Each year, more and more companies start using web scraping tools as a part of their business intelligence and analytics. This helps businesses to become more competitive and profitable.
You should always check if you are able to extract data from a website before scraping. Here is a checklist containing 5 things to consider before doing web scraping.
So you’ve found a website that you can scrape. More likely, you will want to extract data from certain HTML elements, or elements with specific classes or IDs.
Advanced locator strategies such as CSS selector or XPath are both capable to find almost any HTML element on a web page.
Cascading Style Sheets (CSS) is a style sheet language used for describing the look and formatting of a document written in HTML or XML.
CSS Selectors are patterns used to select the styled element(s).
XPath, the XML path language, is a query language for selecting nodes from an XML document. Locating elements with XPath works very well with a lot of flexibility.
XPath uses path expressions to navigate through elements and attributes in an XML document.
Let’s look at the following HTML code.
<div><p class="dataflowkit expandable">Some text in Paragraph</p></div>
In order to match the tag with CSS selector we should do something like this:
XPath locator looks like:
CSS selectors are better to use when dealing with classes, IDs and tag names. They are shorter and easier to read.
Let’s look at the another HTML code.
<p> First </p><p> Second </p><p> Third. Some text in Paragraph </p>
XPath locator for getting content of the third
<p> tag is :
//p[contains(text(), 'Some text in Paragraph')]
How to achieve the same result with CSS Selector?
This is not possible to match content inside <p> tag with Pure CSS Selector.
There are no content selectors in CSS3 specification. We can match on an element, the name of an attribute in the element, and the value of a named attribute in an element. There is nothing for matching content within an element, though.
But, what if we need to do a complex query that takes into consideration the element’s content you’re trying to find? There’s no other way except using XPath.
CSS Selectors + jQuery is going to be a perfect substitute for XPath.
In order to get content of the third
<p>tag from the last example we can use jQuery :contains() Selector:
p:contains('Some text in Paragraph')
Alternatively you can consider
Brief side-by-side comparison of CSS3 Selectors and XPath Expressions.
The table below is adapted from this article.
Use CSS Selectors for doing simple queries based on the attributes of the element. CSS Selectors tend to perform better, faster and more reliable than XPath in most browsers.
But, for more complex queries, to overcome the impossibility of querying an element’s content with CSS Selectors, use XPath or jQuery selectors.
Dataflow kit Selectors
Dataflow kit is no-coding-skills-required web data extraction service. We use CSS Selectors + jQuery to specify HTML elements to scrape data from. In most cases, it is enough to point and select needed elements on a loaded page to scrape data.
Useful resources related to XPath and CSS Selectors.
http://cssify.appspot.com/ Online XPath to CSS translator
https://css2xpath.github.io/ CSS to XPath Online converter
ChroPath — Edit, inspect & verify XPath & CSS selectors in devtools panel