Manipulating XML documents with CDuce

The tutorial below is also a valid CDuce program. You can save it as "" in a fresh directory and learn...

Tutorial (valid source code)

(* -*- tuareg -*-

                CDuce tutorial for the OCaml programmer

   CDuce is a programming language dedicated to the manipulation of XML
   documents. The official documentation is at


   This whole file constitutes a valid CDuce program. 
   -*- tuareg -*- on the first line tells emacs to load
   the tuareg-mode which is normally used for editing OCaml code,
   but works pretty well with CDuce too. 

   Run this program from a fresh directory, by executing:

   It should not display any error message.

   It is recommended to practice both with the interactive mode of cduce,
   and by modifying and compiling the code from emacs with tuareg-mode
   or caml-mode (other text editors are probably fine too).
   To start cduce in interactive mode, just type "cduce" on the command line.
   Most tips for using the ocaml toplevel apply here too

   Prerequisites for this tutorial:
   - you should be reasonably familiar with XML,
   - you should be reasonably familiar with OCaml,
   - you should realize that CDuce is pretty different from OCaml,
     although it shares some syntaxic similarities,
   - you should have a basic idea of what regular expressions are,
     and their usual notations (star, plus, question mark, vertical bar)

   Note about comments: C-style comments using /* and */ should be used
   for text that contains unmatched quotes, while OCaml-style comments
   using (* *) are preferred for commenting out pieces of code.

 * Let's create a simple but realistic example 
 * that we will use throughout this tutorial. 

type a = <a>[ b* ]
type b = <b ..>[ (<c>String)* | Char* ]

let doc : a = <a> [ <b> [] 
		    <b name="b1"> [ <c> "c text 1"
				    <c> "c text 2" ]
		    <b name="b2"> "Pure Text" ]

/* doc represents the following XML code:

  <b name="b1">
    <c>c text 1</c>
    <c>c text 2</c>
  <b name="b2">Pure Text</b>


/* You can input and output XML data using some predefined functions.
   Here is a small list that should be enough for us now:
    print_xml: converts any data to a string (type String)
    print: prints a string to stdout
    dump_xml (CDuce versions >= 0.4.1): 
      takes any data and prints it directly to stdout
    dump_to_file: takes a file name (first argument), a string (second arg.), 
                  and writes the string to the file.

    load_xml: take a file name or a URI, and load it as XML.

  For a full list of primitives, see:

  Let's get started: the following code defines a function "test_io" which
  writes some XML data to a file, and reads it back from the file. Not very
  useful, but instructive.

let test_io (file : Latin1) (data : Any) : Bool =
  let _ = dump_to_file file (print_xml data) in
  let data2 = load_xml file in
  if data = data2 then `true
  else `false

let _ =
  match test_io "doc.xml" doc with
      `true -> [] (* "nil" *)
    | `false -> raise "test_io didn't work as expected"

/* A few notes about the constructs that we saw above:
  - match-with is similar to OCaml, and if-then-else is just a specialization
    of a match-with to booleans;
  - _ has the same meaning as in OCaml;
  - there is no exception type: "raise" accepts data of any type as argument;
  - [] is here used like () in OCaml (unit type). It is actually pretty much 
    like the empty list or nil. The equivalent of lists is called sequences, 
    however their type can define what kind of elements they contain, 
    in which order and how many times they can occur.

/* We already did some pretty advanced stuff:
  - we defined the structure of an XML document (type a);
  - we defined an XML document (doc) of type a;
  - we exported and imported it back from a file;
  - we saw how to apply functions and one syntax for defining a function.

  Now we will see how to manipulate effectively XML data, i.e. transform
  an XML tree into some other data, which typically would be ready to
  be exported to OCaml.

/* Let's explore several syntaxic constructs that will allow us
   do some common tasks */

/* Task 1: extract all the b nodes from doc.

   The slash operator (/) expects:
     - on the left: a sequence of XML nodes (an expression);
     - on the right: a pattern for matching all subnodes.

It important to note that the lefthand expression is a sequence, not just
a node. This why we have to put square brackets around doc. So [doc] is 
a sequence of one element.
let bnodes = [doc] / <b ..> _

/* The previous example was not extremely useful because it returns
  all the childrens of the single node <a>. That could have been
  achieved directly using simple pattern matching:
let achildren = 
  match doc with
      <a> children -> children

/* Before continuing, let's have a closer look at the pattern matching above.
  "children", on the left side of the arrow binds the variable "children"
  to the sequence which constitutes the contents of the <a> node.
  The pattern matching is complete because the type of doc is t.
  It is however possible to cast "doc" to a more general type 
  (a supertype of t).
  For example, the predefined type "Any" represents any possible CDuce value,
  XML or not: a value of any type can be cast to type Any.
  Let's do it: we create doc2, which is the same document as doc, just
  with the general type Any:

let doc2 = doc : Any

/* But now, if you try to define the "achildren" example using doc2 instead
  of doc, cduce will complain. The peculiarity of this type system, as opposed
  to the type system of OCaml, is that there are no polymorphism that
  uses type parameters (e.g. 'a) as in OCaml. For example, you can not
  define a polymorphic identity function in CDuce: it would always return
  something of type Any.
  In OCaml, the identity function can be defined as follows:
    let identity x = x
  Its signature is:
    val identity : 'a -> 'a
  So in OCaml, (identity 123) has type int like 123.
  In CDuce, a generic identity function would always return an object of type
  Any. Let's define it:

let identity (x : Any) : Any = x

/* If you try it in the cduce toplevel, you get this:

      # let identity (x : Any) : Any = x;;
      > val identity : Any -> Any = <fun>

      # identity 123;;
      - : Any = 123

  And if you try to use it as an Int, you get one of those common type errors:

      That works all right (by the way, note the funny type "124" 
      which is a subtype of Int):

      # 123 + 1;;
      - : 124 = 124

      That's the problem we are talking about:

      # identity 123 + 1;;
      Characters 0-12:
      This expression should have type:
      { .. } | Int
      but its inferred type is:
      which is not a subtype, as shown by the sample:

  These error messages can be confusing, but it often means that a more
  specific type was expected. It may mean that you forgot a downcast 
  (see below) or that your data doesn't fit one of your type definitions.

  It is possible to view the same object with another type:
    - a more general type (supertype) is always allowed;
    - a more specific type (subtype) is allowed, if it matches the structure
      of the object.

  The former is something which is possible in OCaml. 
  The latter is a downcast and it is not possible in OCaml,
  since it requires to store some type information at runtime. In CDuce, 
  some typing happens at runtime (dynamically), so downcasts 
  are possible, and naturally may cause runtime errors.

  1. You can change the type of an object to a supertype (upcast) using ":".
  This is done statically, so you will get a message from the compiler
  if the given type does not include the current type of the object.

  2. You can change the type of an object to any compatible type 
  (downcast or upcast) using ":?". 
  This is done at runtime and raises an exception if the requested type 
  is not compatible with the structure of the object.

  The usefulness of static type conversions is limited, just like in OCaml, 
  since there is little need to purposefully set the type of an object 
  to a more general type: it is done automatically when the object 
  is passed as an argument to function which expects a more general type.

  Downcasts are not possible directly in OCaml, and are generally
  considered bad practice anyway. Here, we will use them to check and assign
  a type to an XML document, which usually comes from some data loaded 
  at runtime. Typically, we would load our "doc.xml" file as follows:

let doc_reloaded = load_xml "doc.xml" :? a

/* The command above may fail if the file "doc.xml" does not contain
  an XML document that conforms to type a.
  It is now clear that we use the dynamic cast operator ":?" as a way
  of matching the structure of a document against some predefined pattern,
  i.e. a type.
  Once an XML document has been validated, it can be passed as an argument
  to functions that work exclusively on that type.

/* Let's go back to our sheep, as we say in French.
  We wanted to extract some nodes from our data.
  We saw that we can take a sequence of nodes, select 
  and regroup all the children that match some pattern, using the 
  slash operator:

    let bnodes = [doc] / <b ..> _

  We were saying that this thing above was a bit complicated for just
  extracting the children of <a>. Let's jump to task 2.

/* Task 2: Extract only the <b> nodes that have a "name" attribute. 

  Very easy, we just have to make the pattern (righthand side of the slash)
  a little more specific:

let named_bnodes = [doc] / <b name=_ ..> _

/* Using the same technique twice, we can extract the grandchildren of <a>: */
let cnodes = [doc] / <b ..> _ / <c> _

/* Note that the code above only selects the <c> nodes without attributes,
  because we omitted the ".." wildcard.
  It's okay because this is what we want, but using .. may be a good habit 
  in general.

  It is nice to be able to go down the hierarchy using a sequence of 
  node patterns separated by slashes, like for a filesystem. 
  This explains why the expression (on the left) must be 
  a sequence of nodes rather than just a node.

/* Task 3: Extract the strings that are enclosed within <c> tags, as a sequence
           of strings (rather than a sequence of <c> nodes) */

/* From the previous example, we know how to extract the <c> nodes,
   and they are already stored in the cnodes variable.
   We are going to convert the sequence of <c> nodes into a sequence
   of the same length containing what we want. For this, we use
   the map-with construct. It is analog to in OCaml, but unlike it is not a function.

let ccontents = map cnodes with <c> x -> x

/* Not that what follows the mandatory "with" keyword is a pattern matching,
   not a function. But we can create our own mapf function which
   would take a function as its first argument, and map the list passed
   as second argument:

let fmap (f : Any -> Any) (seq : [Any*]) : [Any*] =
  map seq with x -> f x

/* As opposed to OCaml's and other polymorphic functions,
   the result of fmap would always be of type [Any*] which is the most
   general type of sequence.
   So if you want to use such a function, the result would have 
   to be downcasted using ":?", which involves a runtime check of 
   your data. So you should probably not use that technique.

   However a workaround is presented there:

/* Task 4: Write a function that selects <b> nodes that have a "name"
           attribute of a certain value. This value should be passed
           as a parameter to the function.

  Here is the solution:

let select_bnode (name : String) (seq : [b*]) : [b*] =
  transform seq with 
    x & <b name=y ..> _ -> if y = name then [x] else []

let b1_nodes = select_bnode "b1" bnodes
let b2_nodes = select_bnode "b2" bnodes

/* This solution introduces two main novelties:
   - the transform-with syntaxic construct,
   - the "&" operator in patterns.

  First, let's see what transform-with does. Like map-with, it is a language
  construct, not a function. Like map-with, it scans the elements of 
  a sequence and returns another sequence.
  Its role is to allow mapping and filtering of data at the same time.
  Each item of the list is pattern-matched and must be 
  converted into a sequence of zero, one or maybe more elements.
  With map-with it would result in a sequence of sequences, but here
  the result is flattened, i.e. all the sequences are joined together.

  In OCaml, there is no such builtin functionality, 
  but an equivalent polymorphic function could be written as follows:
    # let rec transform f l = List.flatten ( f l) ;;
    val transform : ('a -> 'b list) -> 'a list -> 'b list = <fun>

  In the transform-with construct, pattern-matching always succeeds, since
  an invisible catch-all case is added and is equivalent to returning 
  the empty sequence []. In other words, all elements that don't match
  are discarded.

  Now let's look at the pattern. It uses "&", placed between two patterns.
  The first pattern "x" matches everything and is just used to bind
  a variable (x) to the whole element. The second pattern "<b name=y ..> _"
  selects <b> elements that have a "name" attribute.
  So the "&" here is used like the "as" keyword in OCaml's pattern matching.
  It is however more general since it allows to force a single object
  to match two different patterns. 

  Please note that CDuce also has a "::" operator, whose role is to name 
  subsequences; it only appears from within the square brackets of sequence
  patterns, e.g.:
     # match [1 2 3 4] with [ _ x :: (_ _) _ ] -> x;;             
     - : [ 2 3 ] = [ 2 3 ]


/* Task 5: Understanding types */

/* CDuce provides a broad set of types, which are reminiscent of OCaml types.
  In addition to those, XML types exist and can be used to represent
  some XML data. However there are several interesting considerations to take
  into account

/* 1) so-called XML types can represent more than just XML documents. In XML,
    data are always string-based. Here, other types can be used, such
    as Ints or records. When converting an object of an XML type, 
    an error would occur if it cannot be converted to real XML: 
    for instance Ints are translated to their string representation, but
    other types like records cause an error. 
    The following object is an XML type a tag <a> that contains a record,
    and it can be manipulated within CDuce:

let xml_with_record = <a> { x = 1; y = 2 }

/* but it cannot be converted to a traditional XML file because
   records don't exist in real XML. So if you try, 
      print_xml (<a> { x = 1; y = 2 })
   would fail.

/* 2) type and variable names can be capitalized or not, 
    but they are case-sensitive, just like XML attribute labels. 
    In addition, type names can be used in pattern-matching, 
    just like capture variables.
    For example, the meaning of 
       match 123 with t -> 456
    depends on the context:
    - If a type t was defined, it means that the structure
      of x should be checked against the pattern defined by type t.
    - If there is no such type as t, then t is considered as a variable,
      which here would be an equivalent for x.

    Test 1: t as a variable (the warning is expected):

      # match 123 with t -> 456;;
      Characters 15-23:
      Warning: The capture variable t is declared in the pattern but not used in the body of this branch. It might be a misspelled or undeclared type or name (if it isn't, use _ instead).
      - : 456 = 456

    Test 2: t as a type

      # type t = Int;;           
      # match 123 with t -> 456;;
      - : 456 = 456

    Test 2 works because 123 actually belongs to type t or Int.
    Using an incompatible type such as String results in an error:

      # match 123 with String -> 456;;
      Characters 6-9:
      This expression should have type:
      but its inferred type is:
      which is not a subtype, as shown by the sample:


/* Task 6: Making CDuce functions and data available to an OCaml program.

  Here is what you need:
  - a CDuce program (
  - a compatible OCaml interface file for the CDuce program (a.mli)

  A CDuce file will constitute an OCaml module. Essentially, cduce will
  compile it into an OCaml implementation file, which use the CDuce runtime

  Sequence of commands to produce the OCaml implementation:

    ocamlfind ocamlc -c a.mli -package cduce
    cduce --compile
    cduce --mlstub a.cdo >

  Then is compiled normally with either ocamlc or ocamlopt, 
   using the CDuce library:

    ocamlfind ocamlopt -c -package cduce

  In a Makefile, in addition to the rules you already use to compile 
  all your .mli and .ml files, you can add those two:

    a.cdo: a.cmi
    	cduce --compile a.cdo
    	cduce --mlstub a.cdo >

  The correspondence between OCaml and CDuce types is described
  in the official documentation at

  We will just give a few remarks and a simple example.

  About translating OCaml types to CDuce types:
  - Not all CDuce types can be converted to OCaml types.
  - Some CDuce types can be converted into different kinds of OCaml types,
    depending on how you define the OCaml interface.
  - Some types remain abstract in OCaml. The most common example is the Char
    type which forms the String type (String is an alias for [Char*]). 
    If you want to use OCaml's string type, you have two options:

    1. If your string only uses Unicode codes 0 to 255, then you can convert
       it from String to Latin1, e.g.  yourstring :? Latin1
    2. If your string may contain Unicode characters above 255, then you 
       may want to export them as-is. The OCaml type you get is 
       Cduce_lib.Encodings.Utf8.t, and it can be converted to a regular
       OCaml string (UTF8 encoded) with the 
       Cduce_lib.Encodings.Utf8.to_string function.
  It is recommended that you browse through the available functions of
  the CDuce library using a tool like ocamlbrowser.

  Example: some OCaml types, followed their CDuce counterparts

  type opt = string option
  type opt8 = Cduce_lib.Encodings.Utf8.t option
  type stringlist = string list

  type variant = A | B of variant
  type variantpoly = `A | `B of variant

  type f = bool -> unit
  type flab = lab:bool -> unit

  type point = { x : float;
		 y : float }

type opt = [ Latin1? ]
type opt8 = [ String? ]
type stringlist = [ Latin1* ]

type variant = `A | (`B, variant)
type variantpoly = `A | (`B, variant)

type f = Bool -> []
type flab = Bool -> []

type point = { x = Float;
	       y = Float }

/* Complete information about interfacing CDuce and OCaml is given at