Data harvesting — what and why?

Social Media 2018
Social Media Writings
5 min readDec 11, 2019

We live in a society that is increasingly connected to the internet. Compared to the 80’s, when personal computers weren’t really a thing and mobile phones over a decade away, we now voluntarily give away huge amounts of information through all sorts of devices. Currently this happens mostly through browsing the internet and our mobile phones, but as technology advances and new devices emerge, more or less everything we do will get recorded by someone — that someone usually being a company or a government.

We give out a lot of information about ourselves while surfing on the internet. While some of it comes in the form of filled out information, such as website account credentials, most of it is information we’re not consciously providing. And even when we are logged in to Facebook for example, we probably aren’t actively approving of combining our identity with the rest of our browsing data. Browsing data means more or less everything recordable that we do online, and this data is something companies are most interested in harvesting. As enough data builds up, and particularly when it is combined with some personal information, it can be used to, in a sense, “know” a person. That may sound weird, but I’ll try to explain. We’re going to have to establish some things first, though.

Data harvesting means, simply put, the gathering of online data. Online data itself isn’t an official term as far as I know, but it suits well for describing any data that is available on the internet. This online data doesn’t need to be data about humans to be harvested by someone — rather it could be anything that is available online. However, from a human standpoint, the data that is collected about us is very interesting, and it already has a significant influence on our lives.

As most of us are connected throughout the day (who doesn’t have a mobile phone?) we are, for example, constantly allowing different companies to access and thus store our location. Who gets the data depends on the apps on our phones, but nevertheless, most applications have constant access to our phones’ contents such as files, contacts and location, unless we have specifically restricted those apps’ permissions. What those apps do with that access is entirely up to the developers who have created the apps.

When we enter a web page, be it on a computer or a mobile phone, we connect to a server somewhere, and send a request saying “Hey, I would like to see this page A you have there”. We are then, most often, granted access and see the page on our screen. To see the page properly on our screen, the page needs quite a bit of information, such as the resolution of our screen, our browser name and so on. The page also sees if we have HTTP cookies, small files on our computer, set to the site. The point is, any page we visit has a lot of information about us and our device by default.

Now, all that is by no means a problem — on the contrary, most of it is very much necessary for us to be able to browse the internet. However, the internet is full of tracking code, whose purpose is, as its name suggests, to track us. By tracking I mean to collect anything we do, such as mouse movements over ads, time spent on certain pages and browsing history, to list a few. This data may at first seem insignificant, but continuously monitoring user behaviour for months and even years leads to huge amounts of data, which can be later combined with other data sources and used to predict user behaviour.

I want to note at this point that although the above example is about surfing the web, data harvesting could happen with anything that is online: smart clocks, smart cars, smart homes, smart phones etc. Just the data that is sent over will be different and thus different information is collected. There may be no trackers, at least yet, but the companies providing these different services collect all the data they please. For example, if a smart watch tracks sleep, the company providing the service is technically able to use that data.

Let us assume that we are a company serving ads to people. Getting data from a single user alone doesn’t necessarily provide anything useful, and knowing that person A supports The New Jersey Devils NHL team probably isn’t all that actionable. However, we have at our hands data collected from millions of users, and in that data we have thousands of other people who also support the Devils. In the data sample, for the sake of this example, we happen to have 60 individuals who live in Finland and have bought Devils merchandise. We also know that all of these people have clicked on Viaplay adverts they have been shown before. We know that person A also lives in Finland and has bought Devils merchandise, but he has no advert history available. We don’t need to know why all the other people have clicked on the Viaplay advert, but we can be quite sure that our person A will also click on that advert. As humans we know (at least now) that the reason they are all interested in Viaplay is because Viaplay broadcasts NHL games in Finland, but the computer doesn’t need to know that — the massive correlation is enough. In this case then, it would surely be profitable to show person A an ad about Viaplay.

So, in essence, all that data can be used to sell stuff. And sadly, that is what most data harvesting is currently all about — more efficient advertising. Another, a bit nastier example is the now quite famous Cambridge Analytica case. In essence, American and British voters were bombarded with millions of microtargeted ads, which likely helped Trump and Brexit campaigns to win. The ads were targeted based on Facebook behaviour and a psychological model, which coupled with a lot of other data created the most efficient campaigns in history. It is a fascinating and terrifying story, and I strongly recommend reading the translation of the original investigation from 2016 on Vice.

The advert example I provided is massively simplified, but a valid case anyway. Add in hundreds of parameters and millions of data points, and a random internet user is no longer that random. With enough data every single one of us can be modeled, and the missing pieces of our puzzle can be calculated based on thousands of near-identical people who already have those pieces filled in. Such predictions may not always be correct, but that simply means the algorithms — the magical mathematical formulas driving all the conclusions — just got a bit better.

All in all, data harvesting in itself isn’t useful — it is merely resource gathering. Because of the advances in computing power in recent years, however, that data can now be used in unprecedented ways. I hope that this text gives some basic tools to understand this complex concept, so that the Vice article, for instance, could become comprehensible. There’s also a great Wikipedia article on Internet privacy, which goes into detail on everything I’ve covered here and much more.

--

--

Social Media 2018
Social Media Writings

This is a shared account used by the students of Aalto University Social Media course. The students use this account if their blog post requires anonymity.