Similar to regular expressions, Xpath can be thought of as a language for finding information in and XML/HTML document. It has many uses, but personally I use it most for developing web crawlers and grabbing information from websites. We’re going to go over the basics of the language, and how to grab the content you need from a document. In order to follow along with this tutorial, you can use the console in your Chrome Developer Tools (any browser developer tools will do) or you can use your favorite web scraping framework. If you want to use your developer tools you need to navigate to your console and start every expression with
$x('YOUR XPATH EXPRESSION') :
Different web scraping frameworks have different syntax. For this tutorial, I will use Scrapy Shell. You can download Scrapy by calling:
pip install scrapy
After you install the framework, run
scrapy shell 'www.website.com' to open and interactive shell that will allow you to query the XPath of a specific page using the syntax
response.xpath('XPATH EXPRESSION').extract() .
A note on writing/copying XPath.
It is possible to easily copy the XPath of a document using your browsers developer tools. All you have to do is go to the developer tools, inspect the html elements, right click on the element you want to locate, and hit copy xpath.
For the above example, this would give you the following result:
Or, if you copied a full XPath:
For web scraping purposes, shorter (more general) first option is always preferable to the second ‘full’ XPath. This is because the full XPath is likely to break first, as the website layout changes.
//*[@id="firstHeading"] ----> Starting from the root of the tree, select every node with the id of 'firstHeading'./html/body/div/h1 ----> Start at the html element, find the body element underneath it, find the 3rd div element underneath that, and select the h1 tag.
div element is added in the website layout, the second expression will not return the correct results. This is why the more general first expression is preferable.
A note on indexes in XPath.
Unlike indexes in many programming languages, keep in mind what when writing XPath expression, the indexes start at 1 NOT 0. So
div is the third
Now let’s get to some helpful XPath tips.
Consider the following XPath expression:
// in the above expression is known as the axis and should describe the set of nodes you want to select from. In this case that would be all of the child nodes after the root of the tree. The
a is the node test, it is an expression that decides whether a node should be selected or not. In this case, the test is whether or not the element is an
This type of direct selection is useful, but you will often run into cases where the element you want to select is not so neatly organized in the DOM. For these cases, you can use an expression such as:
For example, for the XPath wikipedia page the reference links located at the end of the article have the following structure:
There are many links on the page so we cannot just select all of the
a tags and try to figure out which are the references. However, after inspecting the HTML we will see that all of the references are enclosed in a
span with the class “reference-text”. Knowing this we can follow the elements after that span to the information we want.
The above expression is saying, select all of the
href attributes from the
a tags which are a descendant of the
cite tags which are the descendants of the
span tags. That’s kind of a mouthful, but working logically forward you can also think of it as saying: “look for all the spans that have a
cite child tag, and an
a tag under the
cite tag, then grab the
The result will be:
We can get the same information in a different expression using something called a predicate.
In the above expression
//span is returning all of the
span nodes in the document. The syntax in the brackets is used to filter all of the
span tags based any attribute you specify. In our case we only want the span tags with the class of reference-text. This is incredibly useful, as it allows you to grab any tag you want based on attributes of that tag. The result of the above expression is the same as the result of the previous example — all of the reference section direct links are returned.
@href at the end of the expression is just saying “select the href attribute”. You can substitute any attribute value in place of href. For example:
Selecting Dynamic Content.
Many sites dynamically label some attributes, so it is impossible to simply select all of the elements you want by the class name, because the class names are different for each node. Let’s say that you want to select an element from the NewsNow results page. Lets pretend that there is no common class that all of the headlines share (In reality, they all share the
hl class). They all also share a
data-id attribute which is different for each item on the page. For this particular page, the
data-id attributes all start with 107:
I have bolded one attribute that might be useful for selecting all of the results, it seems that
data-cel-widget dynamically changes the numbered result but the string “search_result” is present in every element. In order to grab these elements in XPath we can use the contains() function to select
div ‘s based on the
data-id attributes containing a certain substring.
The expression above will display all of the headlines on the page:
As with most things in life, the best way to get better at using XPath is to practice it. As I learn this language I have found it difficult to find easily digestible documentation regarding how to use XPath, and below I have linked to some good sources if you want to learn more. Mastering XPath will greatly increase your ability to scrape data from the web and solve the problems associated with finding relevant content on the page. Think of each expression as a challenge, and dive in!
Check out my website 💻!
x('//div//p//*') == $('div p *'), $x('//[@id="item"]') == $('#item'), and many other Xpath examples. · One-page guide…
What can XPath do for me?
Maintained by: David J. Birnbaum ( Last modified: 2019-02-15T04:42:56+0000 firstname.lastname@example.org) As we discussed in our…