Screen Grab of the Mozenda Scraper Refining Captured Text with Regex

For developers working on getting business data from the web, there is almost always a need to perform data parsing on web content.

In this post, I want to share on my benchmarking the two basic techniques used in

web scraping to parse the scraped data: Regex* and XPath.

*Regex works as the pattern applied to any text (incl. html) to fetch matched

pieces of content while XPath (similar to CSS path) traverses the DOM html document to select and fetch matched nodes.

We will try to parse sample data with PHP server-side and see the complexity of those techniques and compare time cost for the above-mentioned techniques.

Let’s take a simple mobile.de VW search result page. Its raw html is here — view-source::http://suchen.mobile.de/auto/volkswagen.html, though hardly readable.

REGEX TECHNIQUE

Suppose we want to get title, link, and price for each item.

Let’s look at the page’s html and through developer tools (F12 or Ctrl+Shift+I) find an html content piece pertaining to a single list item. For the sake of simplicity I’ve distilled the real html into the following snippet:

<div class=”vehicleDetails vehicleDetailsMain” data-id=”201841567" data-href=”http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html?lang=en&pageNumber=1">

<div class=”topOfPageTitle”> <a class=”infoLink detailsViewLink” href=”http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html?lang=en&pageNumber=1&negativeFeatures=EXPORT" rel=”nofollow” onclick=”mga(‘send’, ‘event’, ‘car’, ‘/en/public/ses/top of page’)”>Volkswagen Golf VI &quot; STYLE&quot; 1.6 TDI AHK, SHZ, GRA, Klima AL</a> </div> <div class=”topAdDesc”> <span class=”commercial”>Top Ad</span> </div> <div class=”imageWrapper extraMargin”> <div class=”image”> <span></span> <a href=”http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html?lang=en&pageNumber=1&action=topOfPage&negativeFeatures=EXPORT" onclick=”mga(‘send’, ‘event’, ‘car’, ‘/en/public/ses/top of page’)”> <img alt=”Vehicle picture” width=”150" src=”http://i.ebayimg.com/00/s/NDgwWDY0MA==/z/wh4AAOSwuMFUZ3~J/$_18.JPG" /> </a> </div> <div class=”stackBoxOne”></div> <div class=”stackBoxTwo”></div> <div class=”video”><span>Video</span></div> </div> <div class=”description”> <div class=”list”> <span title=”Saloon / Used vehicle German edition New HU Manual gearbox 77&nbsp;kW (105&nbsp;PS), Diesel”>Saloon / Used vehicle German edition New HU…</span> </div> <div class=”fuelConsumption”> <div>Fuel consumption combined:<br/>ca 4.7 l/100 km **</div> <div>CO<sub>2</sub>-Emissions combined:<br/>ca 123 g/km **</div> </div> <div class=”dealerdata”> <div class=”address”> Autohaus Gartner GmbH &amp; Co.KG,83549 Eiselfing </div> <div class=”phoneNumber”> Phone&nbsp;+49 (0)8071 92030 </div> </div> <div class=”rightSideColumns”> <div class=”pricePrimaryCountryOfSale priceGross”>11,400 EUR</div> <div class=”priceSecondaryCountryOfSale priceNet”></div> <div class=”pricePrimaryCountryOfOrigin”></div>

Here goes the PHP code to apply regex for each of the item attributes:

<?php

$html = <<<EOD

<!DOCTYPE html>

EOD;

echo ‘<h2>All links </h2>’;

$patternHref = ‘/data-id=”\d+” data-href=”([^”]+?)”/’;

preg_match_all($patternHref, $html, $links);

$i=1;

foreach($links[1] as $link)

echo ‘<br>’, $i++, ‘. ‘, $link;

$patternTitle = ‘~<a class=”infoLink detailsViewLink”[^>]*?>(.*?)<\/a>~’;

preg_match_all($patternTitle, $html, $titles);

$i=1;

echo ‘<h2>All titles </h2>’;

foreach($titles[1] as $title)

echo ‘<br>’, $i++, ‘. ‘, $title;

$patternPrice = ‘~<div class=”pricePrimaryCountryOfSale priceGross”>([^>]*?)<\/div>~’;

preg_match_all($patternPrice, $html, $prices);

$i=1;

echo ‘<h2>All prices </h2>’;

foreach($prices[1] as $price)

echo ‘<br>’, $i++, ‘. ‘, $price;

The main problem with Regex parsing is inconsistency. See below in the 3 parsed sets, the last one (Prices) includes the prices of 3 more items which are from an ad inserted in the result page. While the values of the first 2 sets (columns) might be related, the third set values hardly relate to the particular items.

I’ve benchmarked the code for 5000 iterations and you can see the result in here (over 9 secs). The time consumption for a single item parse is fair — 0.002 second.

As far as complexity, as an experienced regex composer, it took me about 1.5–2 hours to analyze scraped html and to make the patterns as well as to test them.

XPath TECHNIQUE

Now we come to the XPath technique, this technique is applied to XML/XHTML docs, so we first need to brush up the raw html.

Preparing html as a strict XML

To parse raw data, the xPath technique is far advanced compared to Regex. Strictly speaking, xPath is applied to XML/XHTML docs, so here are the steps to do it with raw html:

  • remove the broken pieces from raw html content
  • make html content a DOM structure (XML document)

See the following code that performs the two above mentioned steps:

<?php

$html = <<<EOD

<!DOCTYPE html>

EOD;

// here we remove unwanted html chars and tags

$html = str_replace(‘&nbsp;’, ‘ ‘, $html);

$html = str_replace(‘<br/>’, ‘ ‘, $html);

$html = str_replace(‘<noindex>’, ‘’, $html);

$html = str_replace(‘</noindex>’, ‘’, $html);

$html = str_replace(‘noindex’, ‘’, $html);

// we suppress libxml internal errors

libxml_use_internal_errors(true);

// Getting the text into the DOM Document for further parse

$DOM = new DOMDocument(‘1.0’, ‘UTF-8’);

$DOM->loadHTML(‘<meta http-equiv=”Content-Type” content=”text/html;charset=UTF-8" />’ . $html);

// Initiating DOM XPath Document for parse

$xpath = new DOMXPath($DOM);

$xpath->registerNamespace(“php”, “http://php.net/xpath");

$xpath->registerPHPFunctions();

Traversing thru xPath

Let’s go to the developer tools for finding xPaths for item attributes.

The title xPath notation is:

//div[@class=”listEntryTitle”]/a[@class=”infoLink detailsViewLink”]/text()

The href attribute of ‘a’ tag xPath notation is:

//div[@class=”listEntryTitle”]/a[@class=”infoLink detailsViewLink”]/@href

The price nodes xPath notation is:

//*[@id=”parkAndCompareVehicle”]/div[1]/div[2]/div/div[2]/div[5]/div[1]/div/div[2]/text()

The thing is that xPath selects/refers to all the nodes under a given notation. So with one shot we get all the nodes and are able to store them. The best way to fetch them in the structured way would be to iterate over them. See the following function, which does that:

<?php

function get_cars_from_xpath_object($xpath) {

$prices = $xpath->query(‘//*[@id=”parkAndCompareVehicle”]/div[1]/div[2]/div/div[2]/div[5]/div[1]/div/div[2]/text()’);

$titles = $xpath->query(‘//div[@class=”listEntryTitle”]/a[@class=”infoLink detailsViewLink”]/text()’); $hrefs = $xpath->query(‘//div[@class=”listEntryTitle”]/a[@class=”infoLink detailsViewLink”]/@href’);

$cars=array();

$i=0;

while($prices->item($i)->nodeValue)

{

$cars[] = array(‘title’=>$titles->item($i)->nodeValue, ‘href’=>$hrefs->item($i)->nodeValue, ‘price’=>$prices->item($i)->nodeValue );

$i++;

}

return $cars;

}

If we want to benchmark xPath parsing we need to iterate over the whole process of (1) removing the broken pieces from raw html, (2) making it DOM structure and (3) traversing by xPath notation to fetch nodes.

For the 500 iterations that include removing unwanted html content and converting to DOM structure, it took 10.5 seconds, about 0.02 seconds for each single item info parse.

The result is rather unexpected but we have to include html preparation and DOM structure in each iteration because in real web scraping a scraper goes over paginated data and for each page, it needs to do the above-mentioned procedures.

The complexity of the xPath notations forming and testing does not exceed Regex’s forming and testing complexity. So the xPath technique is the most common way for getting scraped data into business directories because it almost always returns results which are related and thus reliable from the structured document.

Conclusion

The Regex and xPath benchmarking has shown the `xPath technique` to be superior to `Regex technique` in parsing scraped data and being more precise and structure related. Actually, it is XPath rather than Regex that is being widely used for scraped data parsing. The only note is that when using xPath an html content must be for the most part structured so that the XML library is able to read and understand it.

The average time for single item parse (seconds)

For more information on Web Scraping visit my company Mozenda.com