
Series introduction
This year spending on digital advertising is expected to exceed £250 billion worldwide, with Google and Facebook estimated to account for at least 50% of the global market [1]. This percentage can be increased even further when you account for the fact that many of their closest market rivals, including Amazon and Instagram, can also be considered to be part of their vast third-party ad networks. With this in mind, it seems appropriate to spend my next series of posts trying to demystify some of the methods adopted by these tech giants to collect and manipulate the ever-growing amounts of data generated online and to consider some of the potentially harmful implications this can have on modern society.
Part 1: How do they collect our data?
Over the past decade, the emergence of smart devices and social media has meant it has never been easier for people to stay connected with each other throughout the day. In fact, it is now estimated that 1 in 5 of us spend over four and a half hours a day on our mobile phones alone, amounting to an astounding 68 days each year [2]. As we spend more of our waking hours mindlessly navigating our way around the digital world we are openly placing ever more intimate details about our personal lives in the hands of a small number of global superpowers.

How does Google collect our data?
Google uses a combination of user activity on its own services and other third-party websites to collect data on its users. This includes websites that are powered by the Google custom search engine, such as Amazon and YouTube, and other websites that are signed up to the Google Adsense network [3]. It is estimated that the Google Adsense network spans across roughly two million websites online, allowing its ads to reach around 90% of all internet users globally [4].

The data collected by Google from user activity on its own services can be bucketed into three categories: web and app activity, location history and YouTube history. Clicking on these respective categories within the activity controls section on your Google account will reveal a series of searches made using the Google search engine and YouTube, a map showing the places you have visited whilst logged into Google maps and a list of voice commands recorded using Google’s personal assistant [5]. It is important to note that whilst Google is completely transparent about the data it has collected from its own services, it appears to be much less explicit in sharing what information has been inferred from other third-party websites.

How does Facebook collect our data?
The methods adopted by Facebook to collect data on its users are very similar to those outlined above. However, one significant factor that separates Facebook from Google is the way in which its users interact with its various platforms. For example, whilst Google is predominantly used for browsing the internet, Facebook allows its users to like, comment and share posts made by other users, send direct messages to friends within their network and join pages based on topics related to their personal interests and political affirmations.

Another thing that appears to set Facebook apart from Google is the option for advertisers to provide direct information about its customers [6]. This may include personal information such as phone numbers and email addresses that were required when signing up to their website. Although Facebook claims that this information is made anomalous immediately after it has been transferred over to them, this doesn't stop them from linking this information with the corresponding Facebook profile of the user to target them directly.

Finally, a potentially exciting development for Facebook in the future could be the use of image recognition algorithms to classify the hundreds of millions of photos posted by its users each day. Whilst many state-of-the-art deep networks have already been trained to recognise large samples of pre-labeled images at accuracies of up to 95%, the biggest challenge Facebook faces in implementing this technology on their own platforms is scale. In order to combat this problem of scale, in 2018, Facebook trained a deep network using one billion Instagram photos, along with 15,000 hashtags to act as proxies for human-generated labels. This was reported to have achieved a success rate of around 85.4 %, which whilst reasonably modest when compared with more supervised deep networks, far exceeds any other attempt made under similar constraints [7].
References
[1] https://www.emarketer.com/content/global-digital-ad-spending-2019
[2] https://www.theguardian.com/lifeandstyle/2019/aug/21/cellphone-screen-time-average-habits
[3] https://www.twinword.com/blog/adwords-google-search-partners-list/
[4] https://www.practicalecommerce.com/how-google-collects-data-to-personalize-ads
[5] https://www.wired.com/story/google-tracks-you-privacy/
[6] https://www.eff.org/deeplinks/2019/01/guided-tour-data-facebook-uses-target-ads
[7] https://www.engadget.com/2018/05/02/facebook-trained-image-recognition-ai-instagram-pics/
