Digital Creepy Crawlies are Not as Innocent as We Think!
If you have never analyzed your web server or website logs before, you might not be aware that majority of all visitors to your website are creepy crawlies. By Creepy crawlies, we meant web spiders and crawlers that scrape content from your website. Today, it is widely agreed that almost 60% of all your web traffic on your website are creepy crawlies. We have been working on web traffic hacking over the last 6 months and we are starting to strongly agree with those reported numbers. However, what most average net citizens don’t really know is that these creepy crawlies are not as innocent as we think.
In the good old days, web developers assumed that these creepy crawlies are necessary for the good of their website. Companies like Google, Yahoo, Microsoft and many others have convinced the entire Internet that these web spiders and crawlers are a necessary evil. They exist to make content searchable and index-able so that we can find the information we want more easily. Perhaps, there was some truth in those propaganda but that is definitely no longer true today. A well known example of this ancient propaganda still exist today, it is known as the robots.txt file. Many today believe this to be a web standard for web servers or websites. But we really think this is an example of bad and outdated technology that needs to be eradicated. What robots.txt file does is to tell spiders and crawlers to stay away from certain content on your web server or website. However, there is nothing in this ‘web standard’ that is defined to enforce compliance. Hence, as an analogy, what that actually means is that you stick a big notice on the front of your house telling anyone entering that your door is not locked but if you do enter, stay away from the piggy bank upstairs in your 2nd bed room. :(
Web content or data is like the new digital crude oil. Everyone wants to crawl and archive it for their own benefits or more specifically profits $$$. Creepy crawlies have since evolved into web traffic spam and often used to gather intelligence about web servers, technology stack in use and website content. Some example of using crawled site content are ghost spamming, looking for vulnerabilities, stealing data, digital stalking, email spamming, forum spamming, gaming search engines (SEO) and the list goes on. This necessary evil needs to be contained before it becomes an epidemic. Currently, the only solution that web servers or sites have against these creepy crawlers are simple manual filters that block access.
This is an area that we are researching and developing on with MB. Even though MB already manages to catch 98% of all web traffic spam found on our network, much more needs to be done to automatically identify and disarm these creepy crawlies before they can create trouble on the Internet. Our final bit of advice for this post is that you should never allow these creepy crawlies free reign over your web content and please, don’t depend on robots.txt to help you enforce this important task because you will become a big joke that everyone will talk about for years to come!
Originally published at blog.malleablebyte.org on August 18, 2015.