Lavoisier 2 - Processors

        <processors>
            <!-- standard notation -->
            <processor type="SelectProcessor" match="/library/video[@title='Bride of Chucky']">
                <parameter name="single_node">true</parameter>
            </processor>
            <!-- short notation -->
            <select match="/library/video[@title='Bride of Chucky']" single="true"></select>
        </processors>

All following examples are based on the same input bellow :

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo"></video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten"></video>
    <book id="B1" type="" title="The Flowers of Evil"></book>
    <book id="B2" type="IT" date="2001" title="Learning XML"></book>
</library>

You can test all the examples by copying the views provided in the file attached at the end of this document in your etc/lavoisier-config.xml. Then put the code in the <processor> placeholder.

Overview of processors short notations available in Lavoisier:

<processors xmlns="http://software.in2p3.fr/lavoisier/config.xsd">
    <select match="/absolute/xpath" depth="UNBOUNDED|1..N" single="FALSE|true">
        <group by="{relative/xpath}" sort="UNSORTED|text|number|date" descending="FALSE|true"></group>
        <!-- @match MUST BE a string (attribute, text or comment) -->
        <distinct sort="UNSORTED|text|number|date" descending="FALSE|true"></distinct>
    </select>
    <remove match="/absolute/xpath" depth="UNBOUNDED|1..N"></remove>

    <insert match="/absolute/xpath" nodes="relative/xpath" as="LAST-CHILD|first-child|preceding-sibling|following-sibling|attribute">
        <connector type=""></connector>
        <serializer type=""></serializer>
        <processors></processors>
    </insert>
    <replace match="/absolute/xpath" nodes="relative/xpath">
        <connector type=""></connector>
        <serializer type=""></serializer>
        <processors></processors>
    </replace>

    <rename match="/absolute/xpath" name="{qualified_name}"></rename>
    <insert-parent match="/absolute/xpath" name="{qualified_name}"></insert-parent>

    <move match="/absolute/xpath">
        <to-following-sibling name="*|{qualified_name}" predicate="NULL|{predicate}" as="LAST-CHILD|first-child|preceding-sibling|following-sibling|attribute"></to-following-sibling>
        <to-ancestor name="*|{qualified_name}" predicate="NULL|{predicate}" as="LAST-CHILD|following-sibling"></to-ancestor>
    </move>

    <aggregate match="/absolute/xpath" function="COUNT|sum|avg|min|max"><!-- as="last-child" -->
        <to-last-child name="{qualified_name}" values="relative/xpath"></to-last-child>
    </aggregate>
    <merge match="/absolute/xpath"></merge>
</processors>

Selecting nodes with short notation

Notation

Short notation: <select>
Standard notation: SelectProcessor / SelectGroupByProcessor / SelectDistinctProcessor

Example 1 : select <video> "Bride of Chucky"

Using select processor is like applying a filter : only the data you point by Xpath expression will be selected.

        <processors>
            <select match="/library/video[@title='Bride of Chucky']"></select>
        </processors>

Output:

<e:entries xmlns:e="http://software.in2p3.fr/lavoisier/entries.xsd">
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
</e:entries>

Notice:the output is automatically encapuslated by entries tag to ensure XML format validity if the result is composed of mutltiple node-set. You can avoid this behaviour setting single attribute to 'true' : <select match="/library/video[@title='Bride of Chucky']" single="true"></select>

Example 2 : select all videos keeping only first level only

        <processors>
            <select match="/library/video" depth="1"></select>
        </processors>

Output:

<e:entries xmlns:e="http://software.in2p3.fr/lavoisier/entries.xsd">
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky"></video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo"></video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten"></video>
</e:entries>

Example 3 : get the list of @genre (without duplicates) sorted by descending order.

        <processors>
            <select match="/library/*/@genre">
                <distinct sort="text" descending="true"></distinct>
            </select>
        </processors>

Output:

<e:entries xmlns:e="http://software.in2p3.fr/lavoisier/entries.xsd">
    <e:entry key="music"></e:entry>
    <e:entry key="horror"></e:entry>
    <e:entry key="comedy"></e:entry>
</e:entries>

Trick to remove prefix

You can add a replace processor just after the unique call to clean namespaces from your output :

        <processors>
            <select match="/library/*/@genre">
                <distinct sort="text" descending="true"></distinct>
            </select>
            <processor type="RemoveNamespaceProcessor" match="//*"></processor>
        </processors>

Output:

<entries>
    <entry key="music"></entry>
    <entry key="horror"></entry>
    <entry key="comedy"></entry>
</entries>

Inserting nodes with short notation

Notation

Short notation: <insert>
Standard notation: InsertProcessor / InsertParentProcessor

Example 1 : insert an element 'author' as a an attribute of book 'Learning XML'

        <processors>
            <insert match="/library/book[@title='Learning XML']" as="attribute" nodes="new_attribute('author', 'Erik T. Ray')"></insert>
        </processors>

Output:

<library>
    ...
    ...
    ...
    ...
    ...
    <book id="B2" type="IT" date="2001" title="Learning XML" author="Erik T. Ray"></book>
</library>

Example 2 : insert a short description as first child element for items which have a @genre attribute

        <processors>
            <insert match="/library/*[@genre]" as="first_child" nodes="new_element('description', concat(@type, ' ', @genre))"></insert>
        </processors>

Output:

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason">
        <description>movie horror</description>
    </video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <description> horror</description>
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo">
        <description>movie comedy</description>
    </video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten">
        <description>documentary music</description>
    </video>
    <book id="B1" type="" title="The Flowers of Evil"></book>
    <book id="B2" type="IT" date="2001" title="Learning XML"></book>
</library>

Example 3 : insert <book> elements in a <books> parent as last child of <library>

        <processors>
            <insert-parent match="/library/book" name="books"></insert-parent>
        </processors>

Output:

<library>
    ...
    ...
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo"></video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten"></video>
    <books>
        <book id="B1" type="" title="The Flowers of Evil"></book>
        <book id="B2" type="IT" date="2001" title="Learning XML"></book>
    </books>
</library>

Notice:This will group together all subsequent <book> elements.

Example 4 : joining 2 data views

This processor will allow you to join two XML documents. We will browse here basic usages but please read the detailed documentation for further details : Join 2 data views

Below, a document containing the details of each item of the library. Imagine this document available in Lavoisier via a view called 'library_details'.

<library_details>
    <details id="V1">
        <country>United States</country>
        <director>Ronny Yu</director>
        <producer>Sean S. Cunningham, Douglas Curtis, Stokely Chaffin, Robert Shaye et Renee Witt</producer>
    </details>
    <details id="V3">
        <country>United States</country>
        <director>Woody Allen</director>
        <producer>Robert Greenhut</producer>
    </details>
    <details id="V4">
        <director>Julien Temple</director>
    </details>
    <details id="B1">
        <country>France</country>
        <editor>Larousse</editor>
        <author>Charles Baudelaire</author>
    </details>
    <details id="B2">
        <country>United States</country>
        <editor>O'Reilly</editor>
    </details>
</library_details>

This example appends details of library items reachable in 'library_details' view to library items :

        <processors>
            <insert match="/library/*">
                <connector type="XMLConnector">
                    <parameter name="content" eval="document('input/library_details.xml')"></parameter>
                </connector>
                <processors>
                    <select match="/library_details/details[@id=parent_match()/@id]" single="true"></select>
                </processors>
            </insert>
        </processors>

Output:

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason">
        <details id="V1">
            <country>United States</country>
            <director>Ronny Yu</director>
            <producer>Sean S. Cunningham, Douglas Curtis, Stokely Chaffin, Robert Shaye et Renee Witt</producer>
        </details>
    </video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo">
        <details id="V3">
            <country>United States</country>
            <director>Woody Allen</director>
            <producer>Robert Greenhut</producer>
        </details>
    </video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten">
        <details id="V4">
            <director>Julien Temple</director>
        </details>
    </video>
    <book id="B1" type="" title="The Flowers of Evil">
        <details id="B1">
            <country>France</country>
            <editor>Larousse</editor>
            <author>Charles Baudelaire</author>
        </details>
    </book>
    <book id="B2" type="IT" date="2001" title="Learning XML">
        <details id="B2">
            <country>United States</country>
            <editor>O'Reilly</editor>
        </details>
    </book>
</library>

Replacing nodes with short notation

Notation

Short notation: <replace>
Standard notation: ReplaceProcessor

Example : Set the empty attributes @type to 'not available'.

        <processors>
            <replace match="/library/*/@type[.='']" nodes="'not available'"></replace>
        </processors>

Output:

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="not available" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    ...
    ...
    <book id="B1" type="not available" title="The Flowers of Evil"></book>
    <book id="B2" type="IT" date="2001" title="Learning XML"></book>
</library>

Removing nodes with short notation

Notation

Short notation: <remove>
Standard notation: RemoveProcessor

Example : 2 in 1, see comments !

        <processors>
            <!-- *Example 1 :* remove all items where attribute @genre is not present -->
            <remove match="/library/*[not(@genre)]"></remove>
            <!-- *Example 2 :* remove element <details> but not its content  -->
            <remove match="/library/*/details" depth="1"></remove>
        </processors>

Output:

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <country>United States</country>
        <director>Tom Holland</director>
        <producer>David Kirschner</producer>
    </video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo"></video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten"></video>
</library>

Merging nodes with short notation

Notation

Short notation: <merge>
Standard notation: MergeProcessor

Example 1 : merge <country> text-node to its <country> parent element.

        <processors>
            <merge match="/library/video/details/country"></merge>
        </processors>

Output:

<library>
    ...
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country country="United States"></country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    ...
    ...
    ...
    ...
</library>

Example 2 :merge <details> element with children

        <processors>
            <!-- first merge ALL text nodes of children of <details> node, then
                 move them to attributes : ex <producer producer="David Kirschner"/>-->
            <merge match="/library/video/details/*"></merge>
            <!-- Then merge <details> node with children -->
            <merge match="/library/video/details" count="4"></merge>
        </processors>

Output:

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details country="United States" director="Tom Holland" producer="David Kirschner"></details>
    </video>
    ...
    ...
    ...
    ...
</library>

Aggregating nodes with short notation

Notation

Short notation: aggregate
Standard notation: AppendAggregateProcessor

Example : Count the number of items with a date in 20 st century

        <processors>
            <aggregate match="/library" function="count">
                <to-last-child name="XXst_century" values="*[substring(@date, 1, 2)='19']"></to-last-child>
            </aggregate>
        </processors>

Output:

<library>
    ...
    ...
    ...
    ...
    ...
    ...
    <XXst_century>2.0</XXst_century>
</library>

Notice:You can notice that the result is at the end of the context, a constraint of sreaming processes.

An aggregate function is a function where the values of multiple rows are grouped together as input on certain criteria to form a single value of more significant meaning or measurement such as a set, a bag or a list.(src: Wikipedia). Aggregate processor can be used with the following functions : avg, count, max, min, sum

Moving nodes with short notation

Notation

Short notation: move
Standard notation: MoveProcessor

Example 1.1 : move all <video> elements after <book> B1

        <processors>
            <move match="/library/video">
                <to-following-sibling name="book" predicate="@id='B1'" as="following_sibling"></to-following-sibling>
            </move>
        </processors>

Output:

<library>
    <book id="B1" type="" title="The Flowers of Evil"></book>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo"></video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten"></video>
    <book id="B2" type="IT" date="2001" title="Learning XML"></book>
</library>

Example 1.2 : move all <video> elements at the end of library

        <processors>
            <move match="/library/video">
                <to-ancestor name="library" as="last_child"></to-ancestor>
            </move>
        </processors>

Output:

<library>
    <book id="B1" type="" title="The Flowers of Evil"></book>
    <book id="B2" type="IT" date="2001" title="Learning XML"></book>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo"></video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten"></video>
</library>

Example 2 : create a <books> element as last child of <library> and move all <book> inside

        <processors>
            <insert match="/library" as="last_child" nodes="new_element('books')"></insert>
            <move match="/library/book">
                <!-- REM : as="last_child" is the default, you can also write : <to-following-sibling name="books" as="last_child"/> -->
                <to-following-sibling name="books"></to-following-sibling>
            </move>
        </processors>

Output:

<library>
    ...
    ...
    ...
    ...
    <books>
        <book id="B1" type="" title="The Flowers of Evil"></book>
        <book id="B2" type="IT" date="2001" title="Learning XML"></book>
    </books>
</library>

Notice:if <book> elements are subsequent then prefer using <insert> processor instead of <move> as described in Example 3 of <insert> documentation. It would be more efficient.

Example 3 : move <details> children to <video> ancestor

        <processors>
            <move match="/library/video/details/*">
                <to-ancestor name="video"></to-ancestor>
            </move>
        </processors>

Output:

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
        </details>
        <country>United States</country>
        <director>Tom Holland</director>
        <producer>David Kirschner</producer>
    </video>
    ...
    ...
    ...
    ...
</library>

Joining data views with short notation

Problem statement

The data view to be joined in the examples below are defined as follow.

View VA:

<result>
    <row><HOST>h1</HOST><PATH>p1</PATH><RATE>r1</RATE></row>
    <row><HOST>h2</HOST><PATH>p2</PATH><RATE>r2</RATE></row>
</result>

View VB:

<result>
    <row><HOST>h1</HOST><PATH>p1</PATH><NB>n1</NB></row>
    <row><HOST>h1</HOST><PATH>p1</PATH><NB>n1_bis</NB></row>
    <row><HOST>h3</HOST><PATH>p3</PATH><NB>n3</NB></row>
</result>

All the solution proposed in this section generate the same output data:

<e:entries xmlns:e="http://software.in2p3.fr/lavoisier/entries.xsd">
    <row>
        <HOST>h1</HOST>
        <PATH>p1</PATH>
        <NB>n1</NB>
        <RATE>r1</RATE>
    </row>
    <row>
        <HOST>h1</HOST>
        <PATH>p1</PATH>
        <NB>n1_bis</NB>
        <RATE>r1</RATE>
    </row>
</e:entries>

However, the efficiency of these solutions may differ dramatically depending on the characteristics (i.e. size, number of selected nodes, availability of an index) of views VA and VB.

View VB post-filtered

The first solution retrieves the full data from view VB and then filters it for each row of view VA.

        <processors>
            <insert match="/result/row[*]">
                <connector type="XMLConnector">
                    <parameter name="content" eval="view('VB')"></parameter>
                </connector>
                <processors>
                    <select match="/result/row[HOST=parent_match()/HOST and PATH=parent_match()/PATH]"></select>
                    <insert match="/e:entries/row" nodes="parent_match()/RATE"></insert>
                </processors>
            </insert>
            <select match="/result/row/e:entries/row"></select>
        </processors>

It is the simplest solution, but it is also the less efficient one in respect of I/O and CPU usages because the full stream from view VB is browsed for each row of view VA.

View VB pre-filtered

The second solution retrieves data already filtered from view VB.

        <processors>
            <insert match="/result/row[*]">
                <connector type="XMLConnector">
                    <parameter name="content" eval="view(concat('VB/result/row[HOST=',quot(parent_match()/HOST),' and PATH=',quot(parent_match()/PATH),']'))"></parameter>
                </connector>
                <processors>
                    <insert match="/e:entries/row" nodes="parent_match()/RATE"></insert>
                </processors>
            </insert>
            <select match="/result/row/e:entries/row"></select>
        </processors>

The efficiency is the same as first solution if the filtering is done by browsing the full data like in this example, but it can be significantly improved if filtering relies on an index managed by a cache or a connector plug-in.

View VB built into memory for each row of view VA

The third solution build a in-memory data structure of view VB for each row of view VA.

        <processors>
            <insert match="/result/row[*]">
                <connector type="XMLConnector">
                    <parameter name="content" eval="new_element('e:entries', view('VB')/result/row[HOST=parent_match()/HOST and PATH=parent_match()/PATH])"></parameter>
                </connector>
                <processors>
                    <insert match="/e:entries/row" nodes="parent_match()/RATE"></insert>
                </processors>
            </insert>
            <select match="/result/row/e:entries/row"></select>
        </processors>

This solution will generate the expected result, but it is the less efficient solution in respect of both I/O, CPU and memory usages. Using this solution is not recommended!

View VB built into memory

The fourth solution also build a in-memory data structure of the full content of view VB, but it does it only once per execution.

        <processors>
            <insert match="/result/row[*]" nodes="view('VB')/result/row[HOST=match()/HOST and PATH=match()/PATH]"></insert>
            <merge match="/result/row/*[not(self::row)]"></merge>
            <merge match="/result/row" count="3"></merge>
            <insert match="/result/row/row" nodes="new_element('RATE', new_text(parent::row/@RATE))"></insert>
            <select match="/result/row/row"></select>
        </processors>

This solution is as inefficient as solution 3 (or almost any other XML-based tool) in respect of memory usage.

However, in respect of I/O and CPU usages, it is far more efficient than solution 1 and 3 because it filters data without browsing it. It may also be more efficient than solution 2 if view VA contains a lot of rows.