Lavoisier 2 - XPath

XPath

Overview

Setting parameters with XPath
Setting processor's attribute @match
Forcing Lavoisier to record future events ahead

XPath functions

The different contexts for XPath evaluation
Core XPath functions
EXSLT XPath functions
Lavoisier XPath functions
Custom XPath functions

XPath functions specific to Lavoisier

arguments( )
choose( BOOLEAN condition, OBJECT object1, [OBJECT object2] )
document( STRING url )
entries( [OBJECT objects...] )
entry( STRING key, STRING value )
new_attribute( STRING qualified-name, STRING value )
new_comment( OBJECT value )
new_element( STRING qualified-name, [OBJECT objects...] )
new_text( OBJECT value )
eval( STRING xpath, [NODESET nodes] )
info( )
path( [BOOLEAN required] )
property( STRING key )
reverse( NODESET nodes )
quot( STRING value )
apos( STRING value )
post( )
match( )
parent_match( )
string( OBJECT value )
user()
view( STRING view, [NODESET arguments] )
view_post( STRING view, ELEMENT post, [NODESET arguments] )

Recommendations for writing efficient XPath expressions

Introduction
Initial XPath expression
Minimize the size of built data structures
Minimize the number of data structures that need to be built
Remove the need to build any data structure
Fix the depth of selected nodes
Move predicates at the right place
Set the name (and namespace) of selected nodes

Overview

Setting parameters with XPath

Any XPath expression that is evaluated by an adaptor MUST be set in attribute @eval (or @match in case of processor adaptors) of element <parameter> rather than in the text() node, else it would be considered as a constant value rather than a XPath expression.

This enables Lavoisier to hide the complexity induced by these 2 different evaluation contexts (i.e. one for view invocation + one for each selected XML event). This also enables Lavoisier to optimize the execution of the XPath expression by exploding it into several expressions, which are then distributed between these 2 contexts. In particular, function view() needs to be invoked only once per view invocation, while the relative paths must be invoked for each selected node.

The @match attribute always expects an absolute path, and the default path is /*. The supported values of a parameter depends on the expected type declared for this parameter:

ALLOWED CONTENT	@match='/...'	@match='eval()'	@eval	text()	@eval	text()	@eval	text()
	absolute XPath parameter		relative XPath parameter		XPath expression parameter		non-XPath parameter
/absolute	OK	/	/	/	/	/	/	/
./relative	/	/	OK	/	OK	/	/	/
(expr with ./relative)	/	/	/	/	OK	/	/	/
(expr w/o ./relative)	/	OK	/	/	OK	/	OK	/
constant	/	/	/	/	/	OK	/	OK

Setting processor's attribute @match

Attribute @match of <processor> aims at selecting nodes from the input stream of XML events.

In order to save memory and CPU usage, Lavoisier implements its own XPath engine, which is able to evaluate absolute XPath expressions without generating a huge data structure for most common use-cases.

Evaluating XPath on XML events stream rather than data structure enables processing large amount of data, but it of course also implies some constraints.

Forcing Lavoisier to record future events ahead

When you are processing XML events, you can not see the events that will be processed in the future. As a consequence, accessing to a future event requires that you force the Lavoisier XPath engine to read these events before processing the selected node. You can do this by simply adding any predicate on a future event on the selected node or one of its ancestors, as shown in the following example:

    <view name="xpath_record">
        <connector type="XMLConnector">
            <parameter name="content"><[CDATA[
                <root>
                    <child><leaf id="?"/><id>one</id></child>
                    <child><leaf id="?"/><id>two</id></child>
                </root>
            ]]></parameter>
        </connector>
        <processors>
            <processor match="/root/child[*]/leaf/@id" type="ReplaceProcessor">
                <parameter name="nodes" eval="parent::leaf/following-sibling::id/text()"/>
            </processor>
        </processors>
    </view>

The execution plan for /root/child/leaf/@id shows that no data structure is build:

<XPath xmlns="http://www.w3.org/TR/xpath">
    <element depth="3" localName="leaf">
        <predicate>self::leaf/parent::child/parent::root[not(parent::*)]</predicate>
        <attribute localName="id"></attribute>
    </element>
</XPath>

The execution plan for /root/child[*]/leaf/@id shows that a data structure (<tree>) is build from node <child>:

<XPath xmlns="http://www.w3.org/TR/xpath">
    <tree nodes="self::child[child::*]/child::leaf/attribute::id" depth="2" localName="child">
        <predicate>self::child/parent::root[not(parent::*)]</predicate>
    </tree>
</XPath>

XPath functions

The different contexts for XPath evaluation

Although Lavoisier tries to offer an homogeneous configuration language by supporting the same query language (XPath) in all different contexts, all the functions can not be available in all of them.

ALLOWED FUNCTIONS	@eval	@match	<pre-renderers>	XSLTConnector
Core functions	OK	OK in predicates	OK	OK
EXSLT functions	OK	OK in predicates	subset on server not available on browser	subset
Lavoisier functions	OK	OK in predicates	/	/
Custom functions	OK	OK in predicates	/	/

Core XPath functions

EXSLT XPath functions

Lavoisier XPath functions

Lavoisier defines a set of XPath functions, which:

links data views together: view(), view_post()
links Lavoisier with user: user(), path(), post(), arguments()
links Lavoisier with environment: property(), document()
get data from input XML stream: match(), parent_match()
set data to output XML stream: new_element(), new_attribute(), new_text(), next_comment()

See chapter "XPath functions specific to Lavoisier" for more information about these functions.

Custom XPath functions

You can define your own function as an XML view :

    <view name="UTC-now" xmlns:date="http://exslt.org/dates-and-times">
        <argument name="format">yyyy-MM-dd HH:mm:ss z</argument>
        <variable name="decalage" eval="date:format-date(date:date-time(),'X')"/>
        <connector type="StringConnector">
            <parameter name="content" eval="date:format-date(date:add(                 date:date-time(),                 concat(choose(starts-with($decalage,'+'),'-',''),'PT',substring($decalage,2),'H')             ), $format)"/>
        </connector>
        <serializer type="EncapsulateSerializer"/>
    </view>

This custom function can be invoked:

from another data view with XPath function view():
```
view('UTC-now/HH:mm:ss')
```
from outside with a simple URL: http://localhost:8080/lavoisier/UTC-now/HH:mm:ss?accept=txt

XPath functions specific to Lavoisier

arguments( ): Returns the list of user arguments. This list is a list of <entry> elements. For example, the value of argument 'arg' is arguments()[@key='arg']/text()
choose( BOOLEAN condition, OBJECT object1, [OBJECT object2] ): Returns 'object1' if 'condition' is true, else returns 'object2'.
document( STRING url ): Returns the XML document loaded and parsed from 'url'.
entries( [OBJECT objects...] ): Returns the element node <entries> with child elements 'objects'. 'objects' must contain <entry> elements only.
entry( STRING key, STRING value ): Returns the element node <entry key='key'>value</entry>
new_attribute( STRING qualified-name, STRING value ): Returns the attribute node qualified-name="value".
new_comment( OBJECT value ): Returns the comment node .
new_element( STRING qualified-name, [OBJECT objects...] ): Returns the element node <qualified-name> with child nodes 'objects'.
new_text( OBJECT value ): Returns the text node 'value'.
eval( STRING xpath, [NODESET nodes] ): Returns the result of the evaluation of the expression 'xpath' on each node of the list 'nodes'.
info( ): Returns the XML tree extracted from the <info> section of the current data view.
path( [BOOLEAN required] ): Returns the path provided by user. If no path is provided and 'required' is true, then throws exception. If no path is provided and 'required' is false, then returns null.
property( STRING key ): Returns the property 'key' defined in file etc/app/app.properties or in system properties.
reverse( NODESET nodes ): Returns the provided nodes in reverse order
quot( STRING value ): Returns the string 'value' between double-quotes.
apos( STRING value ): Returns the string 'value' between single-quotes.
post( ): Returns the data sent by the user through his HTTP POST request, or sent by the calling data view.
match( ): Returns the current matching element node. Note that its child nodes are not available by default. In order to make them available, you must force Lavoisier to visit them before executing the plugin action, by adding a predicate on a future XML event, for example "[* or not(*)]".
parent_match( ): Returns the current matching element node of the parent plugin. This concerns plugins that invoke other plugins. This is allowed only with the <insert> or <replace> plugins. See chapter "Joining data views with short notation".
string( OBJECT value ): Returns the string converted from the object 'value' (explicit conversion).
user(): Returns the identifier of the authenticated user. This identifier can be the IP address of the user, the Distinguished Name of his X509 certificate, his login , the result of the CAS authentication or the OAuth2 Access Token
view( STRING view, [NODESET arguments] ): Returns the XML data generated from the data view 'view' with arguments 'arguments'.
view_post( STRING view, ELEMENT post, [NODESET arguments] ): Returns the XML data generated from the data view 'view' with arguments 'arguments' and data 'post'.

Recommendations for writing efficient XPath expressions

Introduction

Since Lavoisier does not (yet) have an optimized to modify its execution plan, the way you write your XPath expression may have a strong impact on performance. In case of performance issue, you may have to rewrite your XPath expression to make its evaluation more efficient.

Then you can reduce memory usage of your XPath expression by trying to:

build <tree> at nodes with higher depth as possible, or build no <tree> at all.
avoid useless build of <tree> by splitting predicate.

You can also reduce CPU usage of your XPath expression by trying to:

avoid useless object instantiations by setting fixed depth of all paths.
avoid useless predicate evaluations by providing node namespaces and names when possible.

The example below shows 7 different XPath expressions that give the same result, but they have different execution plans and of course different efficiencies. Fortunately, the simplest XPath is often the most efficient one as well.

Initial XPath expression

The initial non-optimized XPath expression:

//*[local-name()='element']/son[@id='1']/parent::*/@*[starts-with(parent::*/@foo,'bar') and local-name()='attr' and .>3]

The execution plan shows that a data structure is created from root element (depth=1):

<XPath xmlns="http://www.w3.org/TR/xpath">
    <tree nodes="/child::*[local-name() = 'element']/child::son[attribute::id = '1']/parent::*/attribute::*[starts-with(parent::*/attribute::foo,'bar') and local-name() = 'attr' and . > 3.0]" depth="1" localName="*"></tree>
</XPath>

Minimize the size of built data structures

Minimizing the size of build data structures enables reducing memory usage:

/*/*/*[local-name()='element' and son/@id='1']/@*[starts-with(parent::*/@foo,'bar') and local-name()='attr' and .>3]

The execution plan shows that smaller data structures are built (from nodes with depth=3):

<XPath xmlns="http://www.w3.org/TR/xpath">
    <tree nodes="self::*[local-name() = 'element' and child::son/attribute::id = '1']/attribute::*[starts-with(parent::*/attribute::foo,'bar') and local-name() = 'attr' and . > 3.0]" depth="3" localName="*">
        <predicate>self::*/parent::*/parent::*[not(parent::*)]</predicate>
    </tree>
</XPath>

Minimize the number of data structures that need to be built

Minimizing the number of data structures that need to be built enables reducing CPU usage:

//*[local-name()='element'][son/@id='1']/@*[starts-with(parent::*/@foo,'bar') and local-name()='attr' and .>3]

The execution plan shows that the data structure is built only when node name is 'element':

<XPath xmlns="http://www.w3.org/TR/xpath">
    <tree nodes="self::*[local-name() = 'element'][child::son/attribute::id = '1']/attribute::*[starts-with(parent::*/attribute::foo,'bar') and local-name() = 'attr' and . > 3.0]" depth="1" localName="*">
        <predicate>self::*[not(parent::*)]</predicate>
        <predicate>local-name() = 'element'</predicate>
    </tree>
</XPath>

Remove the need to build any data structure

Removing the need to build any data structure enables reducing memory usage: If the predicate [son/@id='1'] is always true, then we can remove it from XPath in order to remove the need to build a data structure:

//*[local-name()='element']/@*[starts-with(parent::*/@foo,'bar') and local-name()='attr' and .>3]

The execution plan shows that no data structure is built anymore:

<XPath xmlns="http://www.w3.org/TR/xpath">
    <element depth="1" localName="*">
        <predicate>self::*[not(parent::*)]</predicate>
        <predicate>local-name() = 'element'</predicate>
        <attribute localName="*">
            <predicate>starts-with(parent::*/attribute::foo,'bar') and local-name() = 'attr' and . > 3.0</predicate>
        </attribute>
    </element>
</XPath>

Fix the depth of selected nodes

Fixing the depth of selected nodes enables reducing CPU usage:

/*/*/*[local-name()='element']/@*[starts-with(parent::*/@foo,'bar') and local-name()='attr' and .>3]

The execution plan shows that the depth of selected nodes is known. Thanks to this information, the engine will avoid useless object instantiations.

<XPath xmlns="http://www.w3.org/TR/xpath">
    <element depth="3" localName="*">
        <predicate>self::*/parent::*/parent::*[not(parent::*)]</predicate>
        <predicate>local-name() = 'element'</predicate>
        <attribute localName="*">
            <predicate>starts-with(parent::*/attribute::foo,'bar') and local-name() = 'attr' and . > 3.0</predicate>
        </attribute>
    </element>
</XPath>

Move predicates at the right place

Moving the predicates at the right place enables reducing CPU usage:

/*/*/*[local-name()='element' and starts-with(@foo,'bar')]/@*[local-name()='attr' and .>3]

The execution plan shows that engine will not need to evaluate predicate of attributes of elements that does not match starts-with(@foo,'bar'):

<XPath xmlns="http://www.w3.org/TR/xpath">
    <element depth="3" localName="*">
        <predicate>self::*/parent::*/parent::*[not(parent::*)]</predicate>
        <predicate>local-name() = 'element' and starts-with(attribute::foo,'bar')</predicate>
        <attribute localName="*">
            <predicate>local-name() = 'attr' and . > 3.0</predicate>
        </attribute>
    </element>
</XPath>

Set the name (and namespace) of selected nodes

Setting the name (and namespace) of selected nodes enables reducing CPU usage:

/*/*/element[starts-with(@foo,'bar')]/@attr[.>3]

The execution plan shows that engine will not need to evaluate predicates of nodes that do not have expected name.

<XPath xmlns="http://www.w3.org/TR/xpath">
    <element depth="3" localName="element">
        <predicate>self::element/parent::*/parent::*[not(parent::*)]</predicate>
        <predicate>starts-with(attribute::foo,'bar')</predicate>
        <attribute localName="attr">
            <predicate>. > 3.0</predicate>
        </attribute>
    </element>
</XPath>