Detecting and blocking parsehub scrapers

My last post concentrated on how to detect when someone using dexi.io was trying to scrape your site. Looking at the alternatives to dexi.io I found a scraping tool called parsehub. While dexi is a web application parsehub works as a firefox addon. It seems to be a powerful tool that supports complex sites with heavy javascript use.

After playing a bit with parsehub I found the following related to how it works:

  1. Parsehub works as a firefox extension. Firefox extensions provide a lot of control over the browser.
  2. It doesn’t seem to inject javascript code into the web page. All their event handling seems to be done at the extension level. Also style modifications to the page (for example when you select items to extract) appears to be done outside the page. This means that it’s not possible to find out any injected code or styles.
  3. Parsehub prevents all possible human events in the page. This is because they need to use them to direct the scraper creation. Not only clicks are prevented but also mouseovers, keyboard input, etc.

Point 3. means that if we’re able to know that the user is actively using the page(mouse movements on the body for example) but events are not being triggered it’s because events are getting prevented from triggering.

But how we’re able to detect when the user has moved the mouse to be inside the body of the page if events get prevented? The answer is with CSS selectors. In particular the “hover” css selector can be used. http://www.w3schools.com/cssref/sel_hover.asp

This selector is independent from javascript events, the strategy to detect parsehub would be the following (note that this assumes that your site uses javascript to retrieve or display content, if not there’s not much point of using this) :

  1. Set a “hover” property on the body of the page that sets a style(something that doesn’t change much the visual representation of the page but that can be identified later).
  2. Set a function to run in an regular interval. This function will check the computed styles of the body and see if the particular style that the hover sets has changed. If it has changed it means that the user entered the mouse into the body. Set a variable that checks if the body was hovered to true.
  3. Set a mouseover event on the body. As we already know parsehub will prevent this event. If the event is triggered set another variable related to this event to true.
  4. At some point what will happen when using parsehub is that the variable related to the css hover will be true and the variable related to the mouseover event will be false. While this won’t happen in a normal browser.

As a proof of concept I made the following page that is able to detect when you use it with parsehub: