Prevent scraping with just CSS and plaintext

Beto Ayesa
3 min readApr 23, 2016

--

With enough time, proxies, fake headers, etc .., they will be scraping my content every day. Natzar.co is a 1 page = 1 city site. 1 Request you get all the content for one day. You don’t need to crawl 1000 pages, all entire site is 1 page per city. I just show events from today and tomorrow for the moment.

We want they spent time finishing a scraper, and then starting again and again … till they resign or are blacklisted.

Adding confusing extra content

To disuade scrapers, I will generate totally valid events, inside spans in the same line that would be hidden for humans, thanks to css, not for robots.

Now: <span class=”event-title”>Jazz Trio</span>

After:

<span class=”event-title x1”>Jazz Trio</span><span class=”event-title x2">Salsa Lessons</span><span class=”event-title x3">Techno live set</span>

Which one is the good one? :D

x* classes will be randomly generated, from randomly css files.

All will keep changing. Css style files, or css in page position, so robot will have to eat them all, not knowing which span is the one with the valid and correct information.

I think it seems a pretty straightforward solution, simple and easy to implement, but haven’t seen this on any popular site.

It’s true, that if any scraper gets automatically the CSS, where ever the line is, and goes checking for the one span.class that has set display:none, it will find the trick. Everyone should spend more time crafting the bot for natzar and disuade anyone not having good skills.

Detect bot activity

  1. IP BLACKLIST. If a Bot/Scraper is detected, IP is banned and no more content is served. IP Logging + hours of connections ( Detect cronjobs ).
  2. Filter user-agent and Honeypot (Anti noobs)

Disuade bot developers

  1. Random combinations of Html, Js, Css. Random combinations of file names, tags, and css classes. (Don’t try to scrape using ids and classes)
  2. Combinations stay online during different ranges. We want they spent time finishing a solution, and then starting again and again … different frequency change assets: 1 week, 3 days, 1 month. (Don’t try to scrape using ids and classes)
  3. Load all content using Javascript (will disuade a bunch)

Most Commonly used techniques to Prevent Scraping:

  • Setting up robots.txt — This is the most ineffective method to prevent scraping.
  • Filtering requests by User agent — This method merely stops new bots written by inexperienced scrapers for a few hours.
  • Blacklisting the IP address — Less than 2% of scraping bots were detected for one of our customer’s when we did a trial run.
  • Throwing CAPTCHA — Boring for users
  • Honey pot or Honey trap — Honey pots are a brilliant trap mechanism to capture new bots (scrapers who are not well versed with structure of every page) on the website. Search engine bots visit these links and might get trapped accidentally. These links are interpreted as dead, irrelevant or fake links by search engines. With more such traps, the ranking of the website decreases considerably. In short, honey pots are risky business which must be handled very carefully.

To summarize, these prevention strategies listed are either weak or require constant monitoring and regular maintenance to keep them effective. In practice bots are far more challenging than they actually seem to be.

I scraped all big websites, and there seems to be no way to protect content from automatic scraping techniques. When you use proxies + correct delays between requests, is mostly impossible to detect you.

About natzar.co

Natzar is the answer to “What to do today in Barcelona?”. BETA version. I just built the first iteration, focusing in quality rather than quantity of events in Barcelona. When Barcelona is complete 100%, I want to continue with Madrid and Berlin. www.natzar.co

Thanks for reading and I would love to read your thoughts about using the content in anti scraping techniques.

--

--