Introduction to XPath.
--
Similar to regular expressions, Xpath can be thought of as a language for finding information in and XML/HTML document. It has many uses, but personally I use it most for developing web crawlers and grabbing information from websites. We’re going to go over the basics of the language, and how to grab the content you need from a document. In order to follow along with this tutorial, you can use the console in your Chrome Developer Tools (any browser developer tools will do) or you can use your favorite web scraping framework. If you want to use your developer tools you need to navigate to your console and start every expression with $x('YOUR XPATH EXPRESSION')
:
Different web scraping frameworks have different syntax. For this tutorial, I will use Scrapy Shell. You can download Scrapy by calling:
pip install scrapy
After you install the framework, run scrapy shell 'www.website.com'
to open and interactive shell that will allow you to query the XPath of a specific page using the syntax response.xpath('XPATH EXPRESSION').extract()
.
A note on writing/copying XPath.
It is possible to easily copy the XPath of a document using your browsers developer tools. All you have to do is go to the developer tools, inspect the html elements, right click on the element you want to locate, and hit copy xpath.
For the above example, this would give you the following result:
//*[@id="firstHeading"]
Or, if you copied a full XPath:
/html/body/div[3]/h1
For web scraping purposes, shorter (more general) first option is always preferable to the second ‘full’ XPath. This is because the full XPath is likely to break first, as the website layout changes.