Are you truly confidential?

Unraveling Web Mining

Looking for a gaming laptop? The world knows about it!

Looking for weight gain measures? The world knows about it!

Then suddenly you realize that last night you searched or better say “Googled” about the topics that I pointed out. So, who are the prime suspects? You never told anyone, and why are related advertisements popping up in all the apps and websites that you open? If you are a newbie to the awesome world of computer science and lack the knowledge of the functioning of our beloved World Wide Web then perhaps the villain goes incognito and you can never identify him! In this article, we will try to find the culprit.

“World” here refers to the world of internet also known as World Wide Web.

They gather your data. Gather lots of it, from websites you visit, from searches you make, from apps you install, from the questions you ask Siri, Cortana or Google Assistant, and from questions you ask on various forums. A question arises who are they? They are the servers. Servers set up by various companies which provide you with the so-called experience of personalized browsing and fast browsing. You are engulfed in an armada of servers who actually serve many purposes one of which is mentioned above. The whole process of backtracking you and storing your information is known as Web Mining.

Each and every single activity you do on web can be monitored

Now we will see the different areas where you get “spied”. We will see what are the basic elements of Web Mining, how it is done, technical and non-technical aspects will be covered, we will see more of this server thing and many more interesting aspects like which languages are used for this data transmission, who is the carrier of this data and many more. Let’s begin.

Internet Ads and Junk Mails

This is the most common effect of Web Mining that we see in our daily life. Companies make us believe that they are converting what we see into we see what we want to see.

This is done by placing a small text file known as Cookie on our computers. They have the power to retrieve it later when it compiles with other cookies and provide a jigsaw representation of our online presence.

Cookies are very efficient and simple to handle

Now, this makes a path for the targeted and personalized ads which are way more expensive than random ads so websites earn a lot from them. This is the major reason for websites to remain free. The same cookie is used to send you emails that you very conveniently transfer directly to junk.

This picture clearly shows how easy it is to handle data using cookies.


  1. You get to see what you want to see.
  2. You get more choices and knowledge of similar things you wish to pursue.
  3. Perhaps the major reason that your favorite websites are free.

Who Uses this?

  1. All the major online Shopping sites like Amazon, Flipkart, eBay etc.
  2. Websites which allows you to watch videos, listen to songs and entertain you.

Should it be stopped?

Yes, if you want to pay for your favourite websites instead of selling your privacy.

Yes, if you want to see ads for weird stuff which you never see or seek.

Social Networking Sites

The social network provides healthiest data set possible. Gnip is the largest provider of social data, which stores more than 150 billion social activities each month. It offers companies with a vast range of databases from Twitter, Facebook, Tumblr, WordPress, YouTube, G+, and others.

As a matter of fact, Facebook and most of the other social networking sites never sell your personal information. The web miners are the ones scraping your not so private data and sell it to third-party companies who love to use it to pop various kinds of ads and junk on you. Google keeps the data it has about you which is extracted by its various services which are separate from one another, such as your G+ account information like name, age, email address, locations, etc. The other log is associated to your computer, which it anonymizes after every nine months: Your search history, Chrome browser data, Google Maps requests and all the information that its data trackers and agencies (Doubleclick, AdSense, Admob) collect when you browse other sites. Google says that its motto is to organize world’s information and make it universally accessible and useful (an awesome idea but the harsh fact is your information is also part of world’s information).

How is this vast amount of data organised?

Data flows from social networks and Internet companies to data brokers (Yes you heard that right, “brokers” — those creatures who we hate dealing with). This data is combined for creating valuable lists. This is the first step of organizing data. Data is organised under lists which may be titled as:

  1. Planning to buy a laptop.
  2. Searching for a vacation plan Switzerland.
  3. Interested in Taylor Swift.
  4. Interested in Donald Trump.

You get the gist.

One of the largest data brokers, Acxiom Communications, uses more than 500 phone directories to create a profile about you by pairing your name, phone number and address with census information and housing.

Facebook recently banned a data mining company because it mined people’s user ID and followed their interests to put them under its so-called lists.

Each of these pieces of information (and misinformation) about you is sold for about two-fifth of a cent to advertisers which then deliver you an internet ad, send you a catalog or mail you a credit-card offer. This data is collected in many ways, such as tracking devices (like cookies) on websites that allow a company to identify you as you travel around the web, and apps you download on your cell that you look at your contacts lists and location. You know how everything has seemed free for the past years? It wasn’t. It’s just that no one told you that instead of using money, you were paying with your personal information.

Who purchases my Data?

The destination of all the collecting, analysing and trading are the data users. These are mostly marketers and advertisers, but may include fundraisers or non-profits. They buy or rent lists for a better understanding of their targets which are demographically based on specific traits such as ethnicity, income, property value and hobbies. Data users don’t target you specifically. Instead, they analyse a specific list to build what they call “profiles”. Marketers blanket a target area with advertising or open a new store in an up-and-coming neighbourhood. These are some expensive decisions made all the easier by using data.

Sounds interesting? I’ll tell you how it’s done.

Okay let’s dive deep into the technical aspect.

These kind of crawlers are hard to create unless you know exactly how internet works and how to code. A crawler can log and store every bit of data on its way and can even attack some specific data. Now, what’s a crawler? It’s basically a bot. A bot is a program. Every search engine has a bot. These bots are the most popular data mining tools. You can think of a bot as a different kind of browser but instead of grabbing a web page from a server and displaying the HTML on a user’s screen, the bot finds a page, grabs information and logs this information to a database.

A bot usually runs on some triggering action. You can manually run this program, but most data mining bots run on a fixed schedule. We can schedule it for certain times in a day or based on some other kind of triggers such as finding new website or link. You can also use your website visitors to trigger the bot. Let me clarify with an example: Users enter information to your e-commerce signup page and this gives you a URL as a referral for how they find your site. After storing this URL, you can crawl the URL to mine data from it.

Using this information is just as important as how you collect it. A good database design helps to keep data integrity and avoids redundancy. Good database design also effects performance, so unless you want your reports to be rendered in hours and videos in weeks, make sure your database design is normalised and indexed.

A bot’s complexity varies, but you should think of it in the same way you think of your browser. Firstly, the bot does a lookup on the URL using DNS. DNS servers translate friendly domain URLs to IP addresses. The bot program can then use IP address to “find” the web server and website on the internet. Next, the bot can view and store server headers. Server headers are set by your host, but if you have a dedicated server or VPS, you can set customer service headers.

Server headers tell you a few things about the site. First, it tells you the server’s OS. Second, it allows the server to send response codes. There are several response codes for instance, server response “200” means the page was returned without any error. A 404 means “page temporarily not found”. A “503” is service unavailable (usually for scheduled downtime such as maintenance). These are some of the most common responses, but your bot needs to account for each server response.

After you get server response codes, you can grab the HTML. The HTML contains JavaScript and CSS files, HTML codes and links. You will need to crawl the HTML if you want any information about frontend screen that user sees in the browser.

One more issue to be aware of is that site-owners sometimes watch bots and anonymous browsing. If you use too much of a website’s resources, the site owner of the host might block your bot. You should be considerate with a bot and honour a ROBOTS.TXT file. This file contains directives for bots. It’s always stored in the root, and it tells bots which directories and file the host master doesn’t want to be crawled. If you don’t honour this file you might be blocked either by the user agent or IP address (So crawlers are not superpowers, even they have to respect someone or else they may be punished).

Even if we have permissions to crawl these sites, still not much traffic should be pushed in the website. Some hosts don’t have resources to handle a large amount of data so normal user traffic may be affected.

When a user clicks on a website, a “session” begins. A session tracks you from the first page you click on until you exit the site. Your IP address can provide owners your approximate location as well as your computer hardware and software specs. Still, cookies provide a more complete profile of user’s preferences.

So, it seems you are in a loss-loss situation? What can you do now?

Instead of fighting them how about joining them?

Most people are actually interested in selling their data for sake of rewards or money. Such is the pace of market that some websites are now popping to give you the necessary tools for this.

Datacoup is a New York based start-up that pays to access your data, including your social networks and credit card statements.

Company’s CEO Matt Hogan says, ‘collection of personal information is like valuable diamonds created every day through purchases and Internet usage’. This system would help companies store everything from phone numbers to health records all this would help in shaping a cognitive world. Existing personal data vaults already offer this service. In future data vaults could include features to open your data to companies for a price of course.

In much the same way that companies gauge the strength of their personal branding by monitoring how you watch T.V, the way you travel through the deep web is analysed and tabulated into statistical data. This data allows both large and small businesses to develop new products, discover shopping habits of their target markets make important marketing decisions. On one hand, without access to this information, you will find companies struggling to properly determine the interest of their mainstream online audience. Whereas on the other hand, having your browsing monitored can make you feel as though your personal privacy is being invaded. No matter how you feel, there are distinct ways how companies collect your private information when you browse online, and it was important to know how it works.

In this article, I tried my best to provide an overview of various data mining processes and their importance in a jargon-free manner. In future, I will provide you with the best crawler tools to actually start scraping a website and lots of interesting stuff to follow. Leave a clap if you liked it and follow me to receive updates regarding next article.

Thanks for reading.