Detecting and blocking parsehub scrapers
After playing a bit with parsehub I found the following related to how it works:
- Parsehub works as a firefox extension. Firefox extensions provide a lot of control over the browser.
- Parsehub prevents all possible human events in the page. This is because they need to use them to direct the scraper creation. Not only clicks are prevented but also mouseovers, keyboard input, etc.
Point 3. means that if we’re able to know that the user is actively using the page(mouse movements on the body for example) but events are not being triggered it’s because events are getting prevented from triggering.
But how we’re able to detect when the user has moved the mouse to be inside the body of the page if events get prevented? The answer is with CSS selectors. In particular the “hover” css selector can be used. http://www.w3schools.com/cssref/sel_hover.asp
- Set a “hover” property on the body of the page that sets a style(something that doesn’t change much the visual representation of the page but that can be identified later).
- Set a function to run in an regular interval. This function will check the computed styles of the body and see if the particular style that the hover sets has changed. If it has changed it means that the user entered the mouse into the body. Set a variable that checks if the body was hovered to true.
- Set a mouseover event on the body. As we already know parsehub will prevent this event. If the event is triggered set another variable related to this event to true.
- At some point what will happen when using parsehub is that the variable related to the css hover will be true and the variable related to the mouseover event will be false. While this won’t happen in a normal browser.
As a proof of concept I made the following page that is able to detect when you use it with parsehub: