Are trackers the new backbone of the Web?

You don’t need cash to search Google or to use Facebook, but they’re not free. We pay for these services with our attention and with our data.
-Tim Wu

Today, the creation of content on the web is fueled by revenue created from ad networks, that insist on increasingly more precise information about you in order to personalize ads.

As a result, the ways in which users are tracked on the Web has increased in sophistication and prevalence to match those demands. Tracking code aimed at re-identifying users across different web pages are ubiquitous features of modern websites. This sort of bulk profiling returns maximum value for adtech intermediaries, sometimes unbeknownst even to content creators. What an average user may not realize is that the hyperlinks that take them from one site to the next are created by advertising software.

We’ve already made some scary discoveries about the diversity of browsing behavior. A very early examination of a small group of Firefox users’, who voluntarily shared their browsing history, suggested that up to 40% of total Web browsing page views can be attributed to only 65 top level domains. In fact, just five sites (Google, Facebook, Amazon, Yahoo, and Reddit) make up 22% of all traffic for the small group observed. Next, we’re going to try and gain a better understanding of the influence that advertising networks have on navigating the web.

It appears as though browsing traffic is increasingly funneled along the large shipping lanes of the Web. A move to encourage users towards particular sources may be motivated by the revenue model underlying who profits from traffic to particular pages. We’ve recently collaborated with the Information and Language Processing Systems group at the University of Amsterdam to investigate the way that Web content is organized in the context of the prevailing business model of content websites, that directly or indirectly draw revenue from page traffic.

As one of many projects under Mozilla’s Context Graph initiative, we have set out to study the relationship between tracking, advertising, and hyperlinking on today’s Web.

The project included a survey of third-party code execution on a crawl of 355,000 pages and employed community detection algorithms founded in principles of graph theory, to mine for associations between pages based exclusively on the co-occurrence of cookies and third-party executed Javascript. A seed set of web pages selected for their presumed content diversity were examined to find what third-party sourced cookies and javascript was being executed on these pages and where these resources were hosted.

Representing the Web as a network of nodes, community detection was performed using Louvain Modularity, which groups nodes within a network that are more densely connected to one another than to other nodes. This was performed using third-party resources as the network’s edges; not the hyperlinks, nor the content. Having established a few tightly knit communities with a high enough number of members to examine (and having excluded Google Analytics which is virtually everywhere) we sought to explore the relationship between these pages in terms of their hyperlinking behavior as well as the types of domains that are linked via third party resources.

We were curious to examine whether web pages within the same ad network seemed to link to each other more often than to pages outside ad networks. In a first-pass analysis we took into account the following aggregate quantities for a subset of manually selected communities:

  • The frequency with which shared tracking code can be detected between hyper-linked and non-hyper-linked pages
  • The frequency of hyperlinks between pages belonging to advertisement and/or tracking communities (as detected by the Louvain Modularity criterion) compared to the frequency of hyperlinks to arbitrary pages on the Web

The findings showed that sites in the same ad network linked to each other more. For smaller ad networks, hyperlinking inside the network was around twice as likely as outbound links. Whereas, for a company like Google with tracking code deployed across 25,342,946 unique domains, this number can be even higher. One study reports that in 2015, 78.07% of websites in the Alexa top million initiate third-party HTTP requests to a Google-owned domain.

The graphs shown below present edges based on third-party relationships in purple, which are tracking relationships independent of the actual page content. The observation of visible islands means that pages are linked by third-party domains, presumably advertisement or social media networks in such a way that navigation between them is predominantly occurring via those third parties. This is suggestive of a new paradigm of navigation; not necessarily tied to links intentionally by content creators, but rather defined by adtech within pages that share a particular tracking network.

Network of pages purple/blue nodes linked by social widgets deployed on otherwise weakly hyperlinked pages

These visible islands centered around third-party resource providers can’t all be explained in terms of page functionality. In some examples, we observed substantially fewer hyperlinks between pages belonging to separate tracking communities than between pages tracked by a single source of third-party resources. This is a potentially alarming observation, as more content providers choose to outsource tracking and advertisement delivery to only a handful of companies. We chose the examples shown here due to a high internal linking density in terms of third-party resource calls. The actual hyperlinking degree seen in the pages examined was not significantly different from the average observed for the entire crawl, only the proportion of link endpoints to pages inside the tracking network were interestingly high.

The central role of tracker nodes in our dataset places them at the centre of content silos, as can clearly be seen in the graph figures below. As the density of edges linking to tracker nodes increases, the potential for traffic retargeting also increases, meaning that people browsing the Web are increasingly at risk of being funneled towards content based on generating revenue for advertisers.

Community Analysis of Tracking and cookie based community detection by Louvain Modularity

A huge amount of manual work is required to make a judgement regarding whether a piece of third-party code provides legitimate utility on the page versus acting purely as an information aggregator. Upon manually inspecting the top 50 most frequently observed third-party scripts in our data, a large number appeared not to be strictly necessary for the rendering of page content. Even more troubling was the fact that several of the most frequently encountered sources of third party code actually negatively affect web page functionality when tracking code is prevented from executing. In a recent Test Pilot study, we found that many of the pages where users reported page element breakage occurred on pages making third-party calls to frequently encountered tracker nodes.

Rate of user reported problem pings (bar) % of pings indicating incorrectly functioning page elements (orange dots) for pages executing third-party scripts from the listed domains.

Some very serious caveats apply to the the results presented here. These analyses were carried out using a web crawl, this introduces several artifacts, as well as a general bias as to what part of the web we were able to observe. And while many explanations may account for the central role of tracking nodes in the crawl data analyzed here, the possibility that these links are driven by adtech for monetization and not utility warrants further investigation.

I want a web where content discovery is not dictated by the same parties that benefit from specific content consumption. As the discussion around net neutrality heats up, I’ve encountered a plethora of analogies for discussing the state of the Web, including Nick Nguyen’s old growth forest analogy. But I think that my own concerns have already been articulated very nicely by Peter Sunde in an interview.

I don’t want to ride in a self driving car that can’t drive me to a certain place because someone has bought or sold an illegal copy of something there.

-Peter Sunde

The business of online advertising is more complicated than simply advertisers versus users. In fact, journalists are increasingly held hostage by disruptive adtech as the interests of software developers, advertisers, and consumers of web content continue to diverge in incompatible directions. The current tracking-driven advertisement model also raises serious concerns for net neutrality; with recent announcements from both Google and Apple to make active changes to the way their browsers handle the execution of third-party scripts and handle cookies from third parties, we were concerned about how the deployment of tracking code relates to the overall shape of the web in terms of hyperlinks between pages and content discovery. After this study, we are not only concerned but have many more questions to answer.

Mozilla is committed to further exploring the nature and structure of the World Wide Web, the health of the internet and the technologies being employed online. That’s why Mozilla is launching the Firefox Pioneer initiative; we are hoping to find a group of community pioneers to help us better understand the nature of the Web, the way Firefox users interact with it, how people find content, and what kinds of technologies users encounter in their day to day browsing. We hope to apply what we learn to build a better boat, but also to improve Internet Health for everyone.

Several research projects are mentioned in this post and I would like to acknowledge the contributions of Jelmer Neeven, Mees Kalf, Luke Crouch, Don Marti, and Toby Elliott.