Introduction to XPath.

Brendan Ferris
May 9 · 6 min read
Photo by Lili Popper on Unsplash

Similar to regular expressions, Xpath can be thought of as a language for finding information in and XML/HTML document. It has many uses, but personally I use it most for developing web crawlers and grabbing information from websites. We’re going to go over the basics of the language, and how to grab the content you need from a document. In order to follow along with this tutorial, you can use the console in your Chrome Developer Tools (any browser developer tools will do) or you can use your favorite web scraping framework. If you want to use your developer tools you need to navigate to your console and start every expression with $x('YOUR XPATH EXPRESSION') :

Querying XPath in Chrome Developer Tools

Different web scraping frameworks have different syntax. For this tutorial, I will use Scrapy Shell. You can download Scrapy by calling:

pip install scrapy

After you install the framework, run scrapy shell 'www.website.com' to open and interactive shell that will allow you to query the XPath of a specific page using the syntax response.xpath('XPATH EXPRESSION').extract() .

A note on writing/copying XPath.

It is possible to easily copy the XPath of a document using your browsers developer tools. All you have to do is go to the developer tools, inspect the html elements, right click on the element you want to locate, and hit copy xpath.

For the above example, this would give you the following result:

//*[@id="firstHeading"]

Or, if you copied a full XPath:

/html/body/div[3]/h1

For web scraping purposes, shorter (more general) first option is always preferable to the second ‘full’ XPath. This is because the full XPath is likely to break first, as the website layout changes.

//*[@id="firstHeading"] ----> Starting from the root of the tree, select every node with the id of 'firstHeading'./html/body/div[3]/h1 ----> Start at the html element, find the body element underneath it, find the 3rd div element underneath that, and select the h1 tag.

If another div element is added in the website layout, the second expression will not return the correct results. This is why the more general first expression is preferable.

A note on indexes in XPath.

Unlike indexes in many programming languages, keep in mind what when writing XPath expression, the indexes start at 1 NOT 0. So div[3] is the third div element.

Now let’s get to some helpful XPath tips.

Predicates.

Consider the following XPath expression:

//a

The // in the above expression is known as the axis and should describe the set of nodes you want to select from. In this case that would be all of the child nodes after the root of the tree. The a is the node test, it is an expression that decides whether a node should be selected or not. In this case, the test is whether or not the element is an a tag.

This type of direct selection is useful, but you will often run into cases where the element you want to select is not so neatly organized in the DOM. For these cases, you can use an expression such as:

//tag[@attribute='value']

For example, for the XPath wikipedia page the reference links located at the end of the article have the following structure:

There are many links on the page so we cannot just select all of the a tags and try to figure out which are the references. However, after inspecting the HTML we will see that all of the references are enclosed in a span with the class “reference-text”. Knowing this we can follow the elements after that span to the information we want.

response.xpath('//span/cite/a/@href').extract()

The above expression is saying, select all of the href attributes from the a tags which are a descendant of the cite tags which are the descendants of the span tags. That’s kind of a mouthful, but working logically forward you can also think of it as saying: “look for all the spans that have a cite child tag, and an a tag under the cite tag, then grab the href attributes.

The result will be:

All of the reference links on a wikipedia page.

We can get the same information in a different expression using something called a predicate.

response.xpath('//span[@class="reference-text"]/cite/a/@href').extract()

In the above expression //span is returning all of the span nodes in the document. The syntax in the brackets is used to filter all of the span tags based any attribute you specify. In our case we only want the span tags with the class of reference-text. This is incredibly useful, as it allows you to grab any tag you want based on attributes of that tag. The result of the above expression is the same as the result of the previous example — all of the reference section direct links are returned.

Note: The @href at the end of the expression is just saying “select the href attribute”. You can substitute any attribute value in place of href. For example: @id , @class , @title , @rel , @role , @aria etc..

Selecting Dynamic Content.

Many sites dynamically label some attributes, so it is impossible to simply select all of the elements you want by the class name, because the class names are different for each node. Let’s say that you want to select an element from the NewsNow results page. Lets pretend that there is no common class that all of the headlines share (In reality, they all share the hl class). They all also share a data-id attribute which is different for each item on the page. For this particular page, the data-id attributes all start with 107:

I have bolded one attribute that might be useful for selecting all of the results, it seems that data-cel-widget dynamically changes the numbered result but the string “search_result” is present in every element. In order to grab these elements in XPath we can use the contains() function to select div ‘s based on the data-id attributes containing a certain substring.

response.xpath("//div[contains(@data-id, '10')]/div/a/text()").extract()

The expression above will display all of the headlines on the page:

Going further.

As with most things in life, the best way to get better at using XPath is to practice it. As I learn this language I have found it difficult to find easily digestible documentation regarding how to use XPath, and below I have linked to some good sources if you want to learn more. Mastering XPath will greatly increase your ability to scrape data from the web and solve the problems associated with finding relevant content on the page. Think of each expression as a challenge, and dive in!

Happy Coding!

Check out my website 💻!

References:

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Brendan Ferris

Written by

Turning over rocks and seeing what crawls out.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Brendan Ferris

Written by

Turning over rocks and seeing what crawls out.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store