Menu Home

Using XPath in JavaScript

Suppose you have an HTML table in your application that looks like this:

TypeNameHeight (m)Weight (kg)Speed (km/h)
CatFluffy0.243.9948.28
DogFido0.5318.1432.19
HorseMr. Ed1.52771.1188.51
…etc.
Figure 1: Sample table in your application

You have a requirement to extract the records in which the Type is ‘Horse’. This table is coming in from a third-party application, so you can’t change the HTML structure of the table.

You take a look at the structure, and each record in the <tbody> looks like this:

<tr>
    <td>Cat</td>
    <td>Fluffy</td>
    <td>0.24</td>
    <td>3.99</td>
    <td>48.28</td>
</tr>

The easy solution would be using some sort of querySelector or querySelectorAll to point directly to all the ‘Horse’ records, but you’re not going to be able to use that as you can’t get to the data inside the elements. Your first solution then becomes this:

// assumes the table id is "animalTable"
const animalTable = document.getElementById("animalTable");
const records = [...animalTable.querySelectorAll("tbody tr")];
const horseRecords = records.filter( 
    record => record.firstChild.textContent === "Horse"
);

This solution works great – until you have thousands of records. Then it can make your application grunt.

There is another solution: Using XPath

What is XPath?

Put simply, XPath is a query language for locating nodes in an XML document. Remember, an HTML document is a type of XML document, so it’s logical that an XML query language can be used to traverse an HTML document.

By why use XPath when methods like querySelector and querySelectorAll work great? It’s all because of the structure of an HTML document. For example, suppose you have this HTML structure:

<p>
    My cat's name is Fluffy.
</p>

Visually, we see one node here, and from a selector point of view, there is only one node here. But that’s not really the case. Yes, there is one <p> node here, but within that <p> node there is another node – a textNode, and it contains the following:

  • A carriage return/linefeed character(s)
  • A tab character
  • A string containing “My cat’s name is Fluffy”
  • A final carriage return/linefeed character(s)

Selectors can’t see these textNodes, which is why we couldn’t target all the “Horse” records directly earlier. But XPath can see these nodes, and thus we’ll use XPath to get to those records.

Using XPath

This article will not go into all the details of using XPath. For that, I would suggest looking at the Mozilla XPath documentation. For our purposes, we’re going to show what it will take to extract those nodes that contain certain text.

As mentioned earlier, XPath is a query language, and as such we’ll use that language to contruct a query string that will tell JavaScript to access a certain node based upon that query string. JavaScript does this through the following statement:

let xpathResult = document.evaluate(
    xpathExpression,
    contextNode,
    namespaceResolver,
    resultType,
    result
);

What are the arguments?

  • “xpathExpression” is the query string to point to the nodes you wish to locate.
  • “contextNode” is the starting point of the search. Normally this is “document” (the starting location of all HTML documents), but it can be any node in the document.
  • “namespaceResolver” is used to resolve namespaces within the XPath. For HTML documents, it’s best to pass “null”
  • “resultType” lets you describe the format of of the XPathResult that is returned.
  • “result” is used if you have an existing XPathResult you’d like to reuse. Normally, this is kept to “null”

Again, check out the Mozilla XPath Documentation for more information.

Our XPath Solution

Which gets us to our solution. Remember, we have to extract all the “Horse” records based solely upon the textNodes that contain the value “Horse”.

First, here is our code:

const xpathQuery = `//td[contains(text(), "Horse")]`;
const animalTable = document.getElementById("animalTable");
const resultType = XPathResult.ORDERED_NODE_ITERATOR_TYPE;
let xpathResult = document.evaluate(
    xpathQuery,
    animalTable,
    null,
    resultType,
    null
);
console.log(xpathResult)

There are two important pieces here:

  • “xpathQuery”. We’re telling the document.evaluate() to search for any “td” nodes inside “animalTable” in which “text” content contains the value “Horse”. Granted, we need to be a little more precise with our XPath queryString, but for our example this works.
  • “resultType”. Here, we are specifying that we want the result to return the Nodes themselves in the same order they appear in the document. We also want them to be Iterable.

Our example above will return an XPathResult that we need to iterate over in order to create the same thing we got from the querySelectorAll example above. Each iterable item will be a “td”. Therefore, we have just a little more code to execute.

let node = null;
let horseRecords = [];
while (node = xpathResult.iterateNext()) {
    horseRecords.push(node.closest("tr"));
}

Our final “horseRecords” will now have the same information as the querySelectorAll example above.

But is it faster/better/more efficient?

Well, yes and no. I’ve not done any efficiency tests (you can do that as a good exercise). Logically, the XPath method should be a little faster since the initial iterable result is internally built, from which you build the final array. In the querySelectorAll example, you have to build the result from filtering the entire array. It seems to me that building the array from fewer nodes in the XPath beats building the array from going through each element.

That being said, querySelectorAll is optimized for HTML, while XPath is not. Therefore it seems to me that individual operations will take longer using XPath. A good rule of thumb is if you are searching for nodes that a selector can find, use querySelectorAll. Otherwise, like our text example above, XPath will be a good alternative.

So the answer is, again, yes and no. It’s best you do your own work and test it out.

Conclusion

Using XPath is a great way to parse your HTML code for nodes that cannot normally be found by using standard selectors. It’s a great tool to put in your arsenal. Go to the MDN links mentioned above and start diving in.

Categories: HTML Javascript

Tagged as:

thevirtuoid

Web Tinkerer. No, not like Tinkerbell.

Creator of the game Virtuoid. Boring JavaScript. Visit us at thevirtuoid.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: