XPath
XPath is a language used to query different parts of an XML document. There are several versions of the specification from the World Wide Web Consortium (W3C):
- XPath 1.0 was introduced in 1999 and is still the most widely implemented version.
- XPath 2.0 was introduced in 2007, with the second edition in 2010. It offered considerable revisions primarily around a richer type system and is a subset of version 1.0 of the more expansive XQuery language.
- XPath 3.0 was introduced in 2014 and is a subset of XQuery 3.0.
- XPath 3.1 was introduced in 2017 with additional data types to support JSON.
XPath is implemented via libraries for many different programming languages, such as Python, Java and JavaScript. If a language offers tools for any sort of XML processing, it will probably offer support for using some version of XPath to query XML documents.
Syntax
XPath treats an XML document as a tree structure, much like folders stored on a disk drive. In fact, in its simplest form, XPath expressions can resemble file paths. These file-path-like expressions are referred to as location paths. Each location path consists of a sequence of location steps separated by slashes (/
), and each location step has three components:
- An axis specifying what “direction” we’re looking for a node.
- A node test specifying what node we want to match.
- Zero or more predicates that add restrictions onto what node the expression will ultimately match. (i.e. conditions the node has to satisfy in order to match.)
A location step’s full syntax is:
axis-specifier::node-test[predicate]
The most common axes have abbreviated syntax so the whole axis name doesn’t need to be used. An example of this is the fact the child axis is the default so doesn’t require an axis specifier at all.
Axis Specifiers
XPath uses the following axis specifiers:
ancestor
ancestor-or-self
attribute
:@a
is an abbreviation forattribute::a
.child
: The default axis.a
is an abbreviation forchild::a
.descendant
descendant-or-self
://
is an abbreviation for/descendant-or-self::node()/
.following
following-sibling
namespace
parent
:..
is an abbreviation forparent::node()
.preceding
preceding-sibling
self
:.
is an abbreviation forself::node()
.
Node Tests
Node tests consist of specific node names, or more general expressions. The more general expressions include:
node()
or*
matches any node at all.comment()
matches an XML comment node.text()
matches a text node excluding any children.processing-instruction()
matches XML processing instructions.
Predicates
Predicates define an expression that must evaluate true for the node test to succeed. There can be any number of predicates, and they are evaluated in order from left to right. For example, A[1][./@C = "1"]
matches only if the first A
element has an attribute C
equal to “1” whereas A[./@C="1"][1]
matches the first A
element that has an attribute C
equal to “1”. A predicate must have a numeric or boolean value. A numeric value is equivalent to specifying the position of the node to match (i.e. A[5]
is the same as A[position()=5]
).
Location paths can be used inside predicates, and are evaluated relative to the node being tested. so the ./@C
in A[./@C="1"]
is looking to match the C
attribute of the A
element.
The following operators can be used inside of predicates in XPath 1.0:
- The union operator
|
creates the union of two node-sets. - The boolean operators
and
&or
. - The boolean function
not()
. - Mathematical operators:
+
,-
,*
,div
(divide), andmod
. - The comparison operators
=
,!=
,<
,>
,<=
&>=
.
Predicates also have several functions available in XPath 1.0, the following are some of the more commonly used ones:
contains(string1, string2)
: Checks ifstring2
is contained withinstring1
and returns true or false.count(node-set)
: Returns the number of nodes innode-set
.false()
: Returns false.last()
: Returns the position of the last match of the current node test.local-name(node-set)
: Returns the local name of the first node innode-set
, or ifnode-set
is omitted, the local name of the current node.normalize-space(string)
: Returnsstring
with leading and trailing whitespace removed, and all sequences of whitespace characters replaced by a single space. If omitted, operates on the context node converted to a string.position()
: Returns the numeric position of the current node.starts-with(string1, string2)
: Checks ifstring1
starts withstring2
and returns true or false.sum(node-set)
: Converts the string values of the nodes innode-set
into numbers and returns the sum of those numbers.true()
: Returns true.
Examples
Consider the following XML document:
<A>
<B C="1">
<D E="2"/>
</B>
<B C="3">
<D F="4"/>
</B>
</A>
Example XPath expressions include:
A/B/D
: The most basic type of XPath expression. Each location step is separated by a slash. The default axis is the child axis andA
,B
&C
represent the element nodes we’re matching. This expression matches allD
elements that are children ofB
elements that are children ofA
elements. It would match bothD
elements in the example. The expression can be written in expanded form aschild::A/child::B/child::D
.A//D
: Also matches bothD
elements in the example, because the double slash (//
) represents the descendant-or-self axis. This expression matches allD
elements that are descendants of theA
element. The expanded form of the expression ischild::A/descendant-or-self::node()/child::D
.A//D[1]
: Adds a predicate (in the square brackets[...]
) to theD
node test. Using a number alone in the square brackets is a shorthand way of specifying the position of the element. This predicate specifies that we’re only matching the firstD
descendant ofA
. The expanded form of the expression ischild::A/descendant-or-self::node()/child::D[position()=1]
.A//@C
: The@
indicates we’re searching along the attribute axis. In this case, we’re looking for allC
attributes belonging toA
orA
‘s descendants. This matches the twoC
attributes in theB
elements in the example. The expanded form of the expression ischild::A/descendant-or-self::node()/attribute::C
.//*[../@C="3"]
: The*
is a wildcard that matches any node...
means the parent axis, in this case the parent of the node referenced by the predicate. This expression matches any node in the document whose parent has aC
attribute with a value of3
. In the above case that is the secondD
element. The expanded form of the expression is/descendant-or-self::node()/child::node()[parent::node()/attribute::C="3"]
.//*[./@C="3"]
: Same as the prior example, except that.
means the self axis. In this case that refers to the node referenced by the predicate. This expression matches any node in the document that has aC
attribute with a value of3
. In the above case that is the secondB
element. The expanded form of the expression is/descendant-or-self::node()/child::node()[self::node()/attribute::C="3"]
.
Looking to contribute?
- Learn more about how to get involved.
- Edit this page on GitHub to fix an error or make an improvement.
- Submit feedback to let us know how we can improve Docs.