What Is XPath and How to Use It in Octoparse?

Octoparse
DataSeries
Published in
12 min readApr 14, 2020

--

XPath plays an important role when you use Octoparse to scrape data. Rewriting XPath can help you deal with missing pages, missing data or duplicates, etc. While XPath may look intimidating at first, it need not be. In this article, I will briefly introduce XPath and more importantly, show you how it can be used to fetch the data you need by building tasks that are accurate and precise.

1. What is XPath

XPath (XML Path Language) is a query language for selecting elements from an XML/HTML document. It can help you find an element from the whole document precisely and quickly.

Web pages are generally written in a language called HTML. If you load a web page on a browser (Chrome, Firefox, etc), you can easily access the corresponding HTML doc by hitting the F12 key. Everything you see on the webpage can be found within the HTML, such as an image, blocks of text, links, menus and etc.

XPath is the most commonly used language when people need to locate an element in an HTML doc. It can be easily understood as the “path” to find the target element within the HTML doc.

To further explain how XPath works. Let’s look at an example.

The image shows part of an HTML doc.

--

--

Octoparse
DataSeries

Web scraping at a large scale without coding. Start simple, for free. www.octoparse.com