Lavoisier 2 - Processors

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo"></video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten"></video>
    <book id="B1" type="" title="The Flowers of Evil"></book>
    <book id="B2" type="IT" date="2001" title="Learning XML"></book>
</library>

You can test all the examples by copying the views provided in the file attached at the end of this document in your etc/app/views/*.xml file. Then put the code in the <processor> placeholder.

Selecting nodes

Processors

SelectProcessor / SelectGroupByProcessor / SelectDistinctProcessor

Example 1 : select <video> "Bride of Chucky"

Using select processor is like applying a filter : only the data you point by Xpath expression will be selected.

        <processors>
            <processor match="/library/video[@title='Bride of Chucky']" type="SelectProcessor"/>
        </processors>

Output:

<e:entries xmlns:e="http://software.in2p3.fr/lavoisier/entries.xsd">
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
</e:entries>

Notice: the output is automatically encapsulated by entries tag to ensure XML format validity if the result is composed of multiple node-set. You can avoid this behaviour setting the parameter 'single_node' to 'true' :

<processor match="/library/video[@title='Bride of Chucky']" type="SelectProcessor">
                            <parameter name="single_node">true</parameter>
                        </processor>

Example 2 : select all videos keeping only first level only

        <processors>
            <processor match="/library/video" type="SelectProcessor">
                <parameter name="depth">1</parameter>
            </processor>
        </processors>

Output:

<e:entries xmlns:e="http://software.in2p3.fr/lavoisier/entries.xsd">
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky"></video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo"></video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten"></video>
</e:entries>

Example 3 : get the list of @genre (without duplicates) sorted by descending order.

        <processors>
            <processor match="/library/*/@genre" type="SelectDistinctProcessor">
                <parameter name="sort_type">text</parameter>
                <parameter name="sort_descending">true</parameter>
            </processor>
        </processors>

Output:

<e:entries xmlns:e="http://software.in2p3.fr/lavoisier/entries.xsd">
    <e:entry key="music"></e:entry>
    <e:entry key="horror"></e:entry>
    <e:entry key="comedy"></e:entry>
</e:entries>

Trick to remove prefix

You can add a replace processor just after the unique call to clean namespaces from your output :

        <processors>
            <processor match="/library/*/@genre" type="SelectDistinctProcessor">
                <parameter name="sort_type">text</parameter>
                <parameter name="sort_descending">true</parameter>
            </processor>
            <processor type="ChangeNamespaceProcessor" match="//*"/>
        </processors>

Output:

<entries>
    <entry key="music"></entry>
    <entry key="horror"></entry>
    <entry key="comedy"></entry>
</entries>

Inserting nodes

Processors

InsertProcessor / InsertParentProcessor

Example 1 : insert an element 'author' as a an attribute of book 'Learning XML'

        <processors>
            <processor match="/library/book[@title='Learning XML']" type="InsertProcessor">
                <parameter name="nodes" eval="new_attribute('author', 'Erik T. Ray')"/>
            </processor>
        </processors>

Output:

<library>
    ...
    ...
    ...
    ...
    ...
    <book id="B2" type="IT" date="2001" title="Learning XML" author="Erik T. Ray"></book>
</library>

Example 2 : insert a short description as first child element for items which have a @genre attribute

        <processors>
            <processor match="/library/*[@genre]" type="InsertProcessor">
                <parameter name="destination_as">first_child</parameter>
                <parameter name="nodes" eval="new_element('description', concat(@type, ' ', @genre))"/>
            </processor>
        </processors>

Output:

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason">
        <description>movie horror</description>
    </video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <description> horror</description>
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo">
        <description>movie comedy</description>
    </video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten">
        <description>documentary music</description>
    </video>
    <book id="B1" type="" title="The Flowers of Evil"></book>
    <book id="B2" type="IT" date="2001" title="Learning XML"></book>
</library>

Example 3 : insert <book> elements in a <books> parent as last child of <library>

        <processors>
            <processor match="/library/book" type="InsertParentProcessor">
                <parameter name="node_name">books</parameter>
            </processor>
        </processors>

Output:

<library>
    ...
    ...
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo"></video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten"></video>
    <books>
        <book id="B1" type="" title="The Flowers of Evil"></book>
        <book id="B2" type="IT" date="2001" title="Learning XML"></book>
    </books>
</library>

Notice: This will group together all subsequent <book> elements.

Example 4 : joining 2 data views

This processor will allow you to join two XML documents. We will browse here basic usages but please read the detailed documentation for further details : Join 2 data views

Below, a document containing the details of each item of the library. Imagine this document available in Lavoisier via a view called 'library_details'.

<library_details>
    <details id="V1">
        <country>United States</country>
        <director>Ronny Yu</director>
        <producer>Sean S. Cunningham, Douglas Curtis, Stokely Chaffin, Robert Shaye et Renee Witt</producer>
    </details>
    <details id="V3">
        <country>United States</country>
        <director>Woody Allen</director>
        <producer>Robert Greenhut</producer>
    </details>
    <details id="V4">
        <director>Julien Temple</director>
    </details>
    <details id="B1">
        <country>France</country>
        <editor>Larousse</editor>
        <author>Charles Baudelaire</author>
    </details>
    <details id="B2">
        <country>United States</country>
        <editor>O'Reilly</editor>
    </details>
</library_details>

This example appends details of library items reachable in 'library_details' view to library items :

        <processors>
            <processor match="/library/*" type="InsertProcessor">
                <parameter name="nodes" eval="view_path_post('insert_join_0', choose(path(false()), path(false()), ''), match(), arguments())"/>
            </processor>
        </processors>

Output:

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason">
        <details id="V1">
            <country>United States</country>
            <director>Ronny Yu</director>
            <producer>Sean S. Cunningham, Douglas Curtis, Stokely Chaffin, Robert Shaye et Renee Witt</producer>
        </details>
    </video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo">
        <details id="V3">
            <country>United States</country>
            <director>Woody Allen</director>
            <producer>Robert Greenhut</producer>
        </details>
    </video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten">
        <details id="V4">
            <director>Julien Temple</director>
        </details>
    </video>
    <book id="B1" type="" title="The Flowers of Evil">
        <details id="B1">
            <country>France</country>
            <editor>Larousse</editor>
            <author>Charles Baudelaire</author>
        </details>
    </book>
    <book id="B2" type="IT" date="2001" title="Learning XML">
        <details id="B2">
            <country>United States</country>
            <editor>O'Reilly</editor>
        </details>
    </book>
</library>

Replacing nodes

Processors

ReplaceProcessor

Example : Set the empty attributes @type to 'not available'.

        <processors>
            <processor match="/library/*/@type[.='']" type="ReplaceProcessor">
                <parameter name="nodes" eval="'not available'"/>
            </processor>
        </processors>

Output:

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="not available" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    ...
    ...
    <book id="B1" type="not available" title="The Flowers of Evil"></book>
    <book id="B2" type="IT" date="2001" title="Learning XML"></book>
</library>

Removing nodes

Processors

RemoveProcessor

Example : 2 in 1, see comments !

        <processors>
            <!-- *Example 1 :* remove all items where attribute @genre is not present -->
            <processor match="/library/*[not(@genre)]" type="RemoveProcessor"/>
            <!-- *Example 2 :* remove element <details> but not its content  -->
            <processor match="/library/*/details" type="RemoveProcessor">
                <parameter name="depth">1</parameter>
            </processor>
        </processors>

Output:

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <country>United States</country>
        <director>Tom Holland</director>
        <producer>David Kirschner</producer>
    </video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo"></video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten"></video>
</library>

Merging nodes

Processors

MergeProcessor

Example 1 : merge <country> text-node to its <country> parent element.

        <processors>
            <processor match="/library/video/details/country" type="MergeProcessor"/>
        </processors>

Output:

<library>
    ...
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country country="United States"></country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    ...
    ...
    ...
    ...
</library>

Example 2 :merge <details> element with children

        <processors>
            <!-- first merge ALL text nodes of children of <details> node, then
                 move them to attributes : ex <producer producer="David Kirschner"/>-->
            <processor match="/library/video/details/*" type="MergeProcessor"/>
            <!-- Then merge <details> node with children -->
            <processor match="/library/video/details" type="MultiMergeProcessor">
                <parameter name="count">4</parameter>
            </processor>
        </processors>

Output:

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details country="United States" director="Tom Holland" producer="David Kirschner"></details>
    </video>
    ...
    ...
    ...
    ...
</library>

Aggregating nodes

Processors

AppendAggregateProcessor

Example : Count the number of items with a date in 20 st century

        <processors>
            <processor match="/library" type="AppendAggregateProcessor">
                <parameter name="function">count</parameter>
                <parameter name="node_name">XXst_century</parameter>
                <parameter name="node_values" eval="*[substring(@date, 1, 2)='19']"/>
            </processor>
        </processors>

Output:

<library>
    ...
    ...
    ...
    ...
    ...
    ...
    <XXst_century>2.0</XXst_century>
</library>

Notice: You can notice that the result is at the end of the context, a constraint of sreaming processes.

An aggregate function is a function where the values of multiple rows are grouped together as input on certain criteria to form a single value of more significant meaning or measurement such as a set, a bag or a list.(src: Wikipedia). Aggregate processor can be used with the following functions : avg, count, max, min, sum

Moving nodes

Processors

MoveProcessor

Example 1.1 : move all <video> elements after <book> B1

        <processors>
            <processor match="/library/video" type="MoveProcessor">
                <parameter name="destination_axis">following-sibling</parameter>
                <parameter name="destination_name">book</parameter>
                <parameter name="destination_predicate" eval="@id='B1'"/>
                <parameter name="destination_as">following_sibling</parameter>
            </processor>
        </processors>

Output:

<library>
    <book id="B1" type="" title="The Flowers of Evil"></book>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo"></video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten"></video>
    <book id="B2" type="IT" date="2001" title="Learning XML"></book>
</library>

Example 1.2 : move all <video> elements at the end of library

        <processors>
            <processor match="/library/video" type="MoveProcessor">
                <parameter name="destination_axis">ancestor</parameter>
                <parameter name="destination_name">library</parameter>
                <parameter name="destination_as">last_child</parameter>
            </processor>
        </processors>

Output:

<library>
    <book id="B1" type="" title="The Flowers of Evil"></book>
    <book id="B2" type="IT" date="2001" title="Learning XML"></book>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
            <country>United States</country>
            <director>Tom Holland</director>
            <producer>David Kirschner</producer>
        </details>
    </video>
    <video id="V3" type="movie" date="1985" genre="comedy" title="The Purple Rose of Cairo"></video>
    <video id="V4" type="documentary" date="2007" genre="music" title="Joe Strummer: The Future Is Unwritten"></video>
</library>

Example 2 : create a <books> element as last child of <library> and move all <book> inside

        <processors>
            <processor match="/library" type="InsertProcessor">
                <parameter name="destination_as">last_child</parameter>
                <parameter name="nodes" eval="new_element('books')"/>
            </processor>
            <processor match="/library/book" type="MoveProcessor">
                <parameter name="destination_axis">following-sibling</parameter>
                <parameter name="destination_name">books</parameter>
            </processor>
        </processors>

Output:

<library>
    ...
    ...
    ...
    ...
    <books>
        <book id="B1" type="" title="The Flowers of Evil"></book>
        <book id="B2" type="IT" date="2001" title="Learning XML"></book>
    </books>
</library>

Notice: if <book> elements are subsequent then prefer using <insert> processor instead of <move> as described in Example 3 of <insert> documentation. It would be more efficient.

Example 3 : move <details> children to <video> ancestor

        <processors>
            <processor match="/library/video/details/*" type="MoveProcessor">
                <parameter name="destination_axis">ancestor</parameter>
                <parameter name="destination_name">video</parameter>
            </processor>
        </processors>

Output:

<library>
    <video id="V1" type="movie" date="2003" genre="horror" title="Freddy vs. Jason"></video>
    <video id="V2" type="" date="1998" genre="horror" title="Bride of Chucky">
        <details>
        </details>
        <country>United States</country>
        <director>Tom Holland</director>
        <producer>David Kirschner</producer>
    </video>
    ...
    ...
    ...
    ...
</library>

Joining data views

Problem statement

The data view to be joined in the examples below are defined as follow.

View VA:

<result>
    <row><HOST>h1</HOST><PATH>p1</PATH><RATE>r1</RATE></row>
    <row><HOST>h2</HOST><PATH>p2</PATH><RATE>r2</RATE></row>
</result>

View VB:

<result>
    <row><HOST>h1</HOST><PATH>p1</PATH><NB>n1</NB></row>
    <row><HOST>h1</HOST><PATH>p1</PATH><NB>n1_bis</NB></row>
    <row><HOST>h3</HOST><PATH>p3</PATH><NB>n3</NB></row>
</result>

All the solution proposed in this section generate the same output data:

<e:entries xmlns:e="http://software.in2p3.fr/lavoisier/entries.xsd">
    <row>
        <HOST>h1</HOST>
        <PATH>p1</PATH>
        <NB>n1</NB>
        <RATE>r1</RATE>
    </row>
    <row>
        <HOST>h1</HOST>
        <PATH>p1</PATH>
        <NB>n1_bis</NB>
        <RATE>r1</RATE>
    </row>
</e:entries>

However, the efficiency of these solutions may differ dramatically depending on the characteristics (i.e. size, number of selected nodes, availability of an index) of views VA and VB.

View VB post-filtered

The first solution retrieves the full data from view VB and then filters it for each row of view VA.

        <processors>
            <processor match="/result/row[*]" type="InsertProcessor">
                <parameter name="nodes" eval="view_path_post('join_1_0', choose(path(false()), path(false()), ''), match(), arguments())"/>
            </processor>
            <processor match="/result/row/e:entries/row" type="SelectProcessor"/>
        </processors>

It is the simplest solution, but it is also the less efficient one in respect of I/O and CPU usages because the full stream from view VB is browsed for each row of view VA.

View VB pre-filtered

The second solution retrieves data already filtered from view VB.

        <processors>
            <processor match="/result/row[*]" type="InsertProcessor">
                <parameter name="nodes" eval="view_path_post('join_2_0', choose(path(false()), path(false()), ''), match(), arguments())"/>
            </processor>
            <processor match="/result/row/e:entries/row" type="SelectProcessor"/>
        </processors>

The efficiency is the same as first solution if the filtering is done by browsing the full data like in this example, but it can be significantly improved if filtering relies on an index managed by a cache or a connector plug-in.

View VB built into memory for each row of view VA

The third solution build a in-memory data structure of view VB for each row of view VA.

        <processors>
            <processor match="/result/row[*]" type="InsertProcessor">
                <parameter name="nodes" eval="view_path_post('join_3_0', choose(path(false()), path(false()), ''), match(), arguments())"/>
            </processor>
            <processor match="/result/row/e:entries/row" type="SelectProcessor"/>
        </processors>

This solution will generate the expected result, but it is the less efficient solution in respect of both I/O, CPU and memory usages. Using this solution is not recommended!

View VB built into memory

The fourth solution also build a in-memory data structure of the full content of view VB, but it does it only once per execution.

        <processors>
            <processor match="/result/row[*]" type="InsertProcessor">
                <parameter name="nodes" eval="view('VB')/result/row[HOST=match()/HOST and PATH=match()/PATH]"/>
            </processor>
            <processor match="/result/row/*[not(self::row)]" type="MergeProcessor"/>
            <processor match="/result/row" type="MultiMergeProcessor">
                <parameter name="count">3</parameter>
            </processor>
            <processor match="/result/row/row" type="InsertProcessor">
                <parameter name="nodes" eval="new_element('RATE', new_text(parent::row/@RATE))"/>
            </processor>
            <processor match="/result/row/row" type="SelectProcessor"/>
        </processors>

This solution is as inefficient as solution 3 (or almost any other XML-based tool) in respect of memory usage.

However, in respect of I/O and CPU usages, it is far more efficient than solution 1 and 3 because it filters data without browsing it. It may also be more efficient than solution 2 if view VA contains a lot of rows.