XPaths and their relevance in Web Scraping

XPath (XML Path Language) is a syntax for defining parts of an XML document.

XPath is a query language for identifying and selecting nodes or elements in an XML document using a tree like representation of the document. XPath was defined by the World Wide Web Consortium (W3C).

XPaths are one of the few ways in which you can select some content from a big blob of XML or HTML (properly structured HTML is similarly structured as an XML document) content. A Xpath tells you the location of an element, just like a catalog card does for books.

For a web scraper, a web page is the “library” and the the element you are looking for ( say Best Books of The Month is what you are looking for ) is the “Book” and the “Catalog card” is your Xpath.

In the web scraping world, Xpath is a selector — something that is used to “select” an element or a bunch of elements in an HTML or XML page for scraping.

For example, in the image above, if we wanted to find out the heading of an Amazon best seller list. The xpath for the heading would be something like

//h1/text() OR //h1/b/text()  Both of the above works. //h1/text() just says, go to the H1 tag where ever you can find it, get the text inside it. //h1/b/text() say, go to the H1 tag where ever you find it, go to the <b> tag in it and get the text inside it.

In both cases the text is — Best Book of the Month

Another simple example is //title/text() which give you the Title tag of an HTML Page

XPath vs CSS Selectors

Another alternative selector used in Web scraping is called the “CSS Selector”, it is a selector similar to the XPath, but used by web developers or designers in the CSS styles of the web pages. Xpaths are far more powerful compared to CSS Selectors because you can put a lot of logic into a single Xpath statement to precisely identify the right element on a web page.

Website Changes, XPaths and Scraper Maintenance

If you have worked with scrapers or web scraping services, you might already know about the “maintenance” for scrapers. You may already have heard developers complaining –

“The website changed their design and the Xpaths have changed”.

Xpaths change when a website changes the way the HTML is structured. It is just like rearranging a library. Every time a library is rearranged the location of a book might change. So they have to update the change in location inside the catalog / card or else no one will be able to find the book.

Similarly, the web scraper code has to be updated every time a website changes its structure. It is also possible the structure of the website stay the same but the visual design changes. In that case the XPaths may not change and the scraper code may still work as before.

The scope of this article is limited to explaining what Xpaths are and why they are relevant in web scraping. We will be covering CSS Selectors and other scraping relevant terms in the future.


Originally published at learn.scrapehero.com on April 27, 2016.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.