Main concepts

Introduction

The XML Template Language of Lavoisier aims at transforming the input XML data stream into a new output XML data stream, by writing rules rather than writing imperative instructions. This language allows a better maintainability of your application, as well as a better performance because it enables Lavoisier to process the data stream on-the-fly without building big in-memory data-structures.

An XML Template is itself written in XML, following the syntax described below.

An XML Template is a tree of rules, that follows the hierarchy of the nodes in the input XML data stream. You can chain several templates under the tag <processors>.


The 13 rule types

The core of the XML template language is composed of a few keywords:

  • Node types: element, attribute, text, comment
  • Actions: create, ignore, keep/update

The 13 rule types are obtained by combining these keywords:

create ignore keep/update
element <element-create>

<element-create-as-parent>
<element-ignore> <element>
attribute <attribute-create> <attribute-ignore> <attribute>
text <text-create> <text-ignore> <text>
comment <comment-create> <comment-ignore> <comment>

As with XML language, the rules <element-*> (except <element-create>) can contain child rules, while the other rules (<attribute-*>, <text-*>, <comment-*>) can not contain any child rule.


Implicit rules
  • <element>: For these rules, child nodes that are not explicitly ignored are implicitly kept.
  • <element-ignore>: For these rules, child nodes that are not explicitly kept are implicitly ignored.

XPath

In each rule, the predicate and value of created or modified nodes is obtained via the evaluation of an expression written in the XPath language.

This language allows to define a path into a XML document, like you would describe a path into a file system:

cd /usr/local/lib
cd ../include
ls .

The main differences with file system paths are:

  • In addition to the 3 usual axis (child, parent, self), XPath support other axis:
    • ancestor
    • ancestor-or-self
    • descendant
    • descendant-or-self
    • preceding-sibling
    • preceding
    • following-sibling
    • following
  • XPath has a powerful predicate support on each node of the path. Recursively, a predicate is composed of an XPath expression that can contain predicates.
  • XPath supports functions. In Lavoisier, supported functions are the following:
  • XPath supports namespaces. If the input XML is bound to a namespace, then this namespace must be mapped to a prefix with xmlns:yourPrefix="namespace", and this prefix must be added on every element step of your path.

These few differences with file system paths makes XPath a language far more powerful than you may think!


Context of a rule

The relative position of a rule is the same as the relative position of its matching nodes. Then, the context of a rule is its matching node, and any XPath expression defined in a rule is relative to this matching node (unless it starts with character '/' or course).

As a consequence, accessing to the child nodes of the parent of the context node requires to first navigate up to this parent element (..). This also apply to the nodes being created:

            <element in="root">
                <attribute in="anAttributeOfRoot">../text() + 1</attribute>
                <attribute-create out="anotherAttributeOfRoot">../text() + 2</attribute-create>
            </element>

Relative paths are of course not restricted to the parent axis. Although some of them (in particular the axis "preceding" and "following") may have a significant impact on the size of the data-structure needed for Lavoisier to execute the rule, any XPath axis can be used:

            <element in="root">
                <element in="node" out="hasDescendantXXX" if="descendant::XXX"/>
            </element>

This context is kept unchanged within the current template. Although the processing is done on-the-fly, the result of its execution (i.e. the modified context) is only seen after the template (for example in the next template or adaptor).

In other words, a rule does not modify the context of the other rules of the same template:

            <element in="root">
                <attribute in="anAttributeOfRoot">'newValue'</attribute>
                <element in="node">
                    <!-- will take the old value of attribute @anAttributeOfRoot rather than its new value 'newValue' -->
                    <attribute in="anAttributeOfNode">ancestor::root/@anAttributeOfRoot</attribute>
                </element>
            </element>

As a consequence, writing a rule that gets data from node(s) removed within the same template makes sense, and it will work:

            <element in="root">
                <element-ignore in="node"/>
                <element-create>new_element('new', ../node/@anAttributeOfNode)</element-create>
            </element>

Impact of rules order

The order of the rules may impact the choice of the selected rule for a given node. Indeed, when several rules match the current node, then the first one will be chosen. If the rules are exclusives (i.e. there is only 1 possible matching rule per node), then the order does not matter.

            <element in="root">
                <element in="node" out="hasChild" if="*"/>
                <element in="node" out="isLeaf" if="not(*)"/>
            </element>
...is the same as:
            <element in="root">
                <element in="node" out="isLeaf" if="not(*)"/>
                <element in="node" out="hasChild" if="*"/>
            </element>

The order of the rules does not impact the order in which they are executed. Indeed, the rules are executed in the order of the matching nodes in the input XML stream.


How to set a variable

An additional keyword allows for setting a variable to be used within the template: <set>

Supported text/attribute nodes are:

  • text(): The content of the variable, specified as an XPath expression (required). This XPath expression is relative to the current context. The variable can contain a string, as well as a set of nodes for example.
  • variable: The name of the variable (required). You can read its value in any XPath expression that is under its scope, by prefixing its name with the character '$' (e.g. $myVar).
  • index: The key of each selected node, specified as an XPath expression (optional). This XPath expression is relative to each selected node. This attribute allows for significantly improving performance by converting the current variable to a hash-table, when this variable contains a list of nodes. The obtained hash-table variable must then be used as the first argument of the XPath function find(), the second argument being the value of the key of the entry in the hash-table.

As we would expect, the scope of this variable is the subtree under the node for which the variable has been set.

Rules attributes

Rule attributes that contain an XPath expression

These attributes contain an XPath expression:

  • if: The rule will match the current node only if the specified XPath predicate is evaluated to "true".
  • attributes: The current node will take the list of attributes specified by the XPath expression. You can use this to select existing or new nodes, separated with the union operator ('|'). Selected element nodes will be automatically converted into attribute nodes, using the element name as the attribute name, and the child text node as the attribute value.

All the rules except <element> and <element-ignore> take an XPath expression as their text node.

Note that setting a literal value in a field that expects an XPath expression requires this value to be nested into ' or ":

            <element in="root">
                <attribute in="myAttribute">'this is the new value'</attribute>
            </element>

Rule attributes that do not contain an XPath expression

The following attributes do not contain an XPath expression:

  • in: If missing then any node will match the rule (optional). The value of this attribute is a Qualified Name (i.e. a node name with an optional namespace prefix and ':'). Supported by:
    • <element> and <attribute>
    • <element-ignore> and <attribute-ignore>
    Example:
    <processors xmlns:db="http://docbook.org/ns/docbook">
        <element in="db:article"/>
    </processors>
  • out: If specified then the node will be (re)named. The value of this attribute is a Qualified Name (i.e. a node name with an optional namespace prefix and ':'). Supported by:
    • <element> and <attribute> (optional)
    • <element-create-as-parent> and <attribute-create> (required)
    Example:
    <processors xmlns:db="http://docbook.org/ns/docbook" xmlns:x="http://www.w3.org/1999/xhtml">
        <element in="db:article" out="x:html"/>
    </processors>
  • as: Set the relative position of the created nodes. The value of this attribute is an enumeration: last-child (default), first-child, preceding-sibling, following-sibling. Supported by:
    • <element-create> (optional)
    • <element-create-as-parent> is defined in a separate rule type because its supported child nodes are element rules, rather than a text node containing an XPath expression.
    Example:
                <element in="root">
                    <element-create as="first-child">new_element('first-child')</element-create>
    
                    <element-create as="preceding-sibling">new_element('preceding-sibling')</element-create>
                    <element in="node"/>
                    <element-create as="following-sibling">new_element('following-sibling')</element-create>
    
                    <element-create-as-parent out="parent">
                        <element in="ex-last-node"/>
                    </element-create-as-parent>
    
                    <element-create as="last-child">new_element('last-child')</element-create>
                </element>
    ... may produce the following output:
    <root>
        <first-child/>
        <ex-first-node/>
    
        <preceding-sibling/>
        <node/>
        <following-sibling/>
    
        <parent>
            <ex-last-node/>
        </parent>
        <last-child/>
    </root>
  • namespace-ignore: A convenient way to remove the namespace of current element node. The value of this attribute is a boolean. Supported by:
    • <element>
    This attribute must not be set when attribute out is set, since this later attribute already allows for renaming the node without its initial namespace.
  • recursive: A convenient way to recursively apply a rule. The value of this attribute is a boolean. Supported by:
    • <element> and <element-ignore>
    Example:
                <element in="person" if="@birthday='true'" recursive="true">
                    <attribute in="age">. + 1</attribute>
                </element>
    ... will increment the age of all the persons with attribute @birthday='true' in a genealogy tree (i.e. each person may have one or more child persons).
  • flat: A convenient way to flatten the current XML subtree, by moving the child nodes of current node as its following siblings. The value of this attribute is a boolean. Supported by:
    • <element> and <element-ignore>
    Example:
                <element in="person">
                    <element in="person" out="descendant" flat="true" recursive="true"/>
                </element>
    ... will return the list of descendants of the root node in a genealogy tree (i.e. each person may have one or more child persons).
  • future: Force Lavoisier to build an in-memory data-structure from here. The value of this attribute is a boolean. Use this attribute only when Lavoisier can not automatically detect the need for it by analysing your XML template, for example when you invoke the function eval() to evaluate an XPath expression that has been dynamically generated by your own code. Supported by:
    • <element> and <element-ignore>
    Example:
                <element in="root">
                    <element in="node" future="true">
                        <attribute in="anAttributeOfNode">eval($generatedXPath)</attribute>
                    </element>
                </element>

Shortcut notations

XML Template shortcut notations
  • <elements path="root/node/leaf"/> is equivalent to this:
                <element in="root">
                    <element in="node">
                        <element in="leaf"/>
                    </element>
                </element>
  • <elements-ignore path="root/node/leaf"/> is equivalent to this:
                <element-ignore in="root">
                    <element-ignore in="node">
                        <element-ignore in="leaf"/>
                    </element-ignore>
                </element-ignore>

XPath shortcut notations
  • * is equivalent to child::*
  • . is equivalent to self::*
  • .. is equivalent to parent::*
  • .//foo is equivalent to descendant::foo