Field guide to tag hunting: The hidden world of Internet data collection

Part 1: What’s going on and why it’s getting worse

Published in

B2B product

12 min readMay 9, 2018

The above picture is among my most prized possessions, a snapshot of how wild and woolly the world of Internet data collection has become.

Beginning in 2013, I’ve spent thousands of hours scanning millions of websites to better understand how data is collected and visitors are tracked. This journey started when I launched my data start-up Mezzobit, but it soon became an obsession, consuming many nights and weekends researching the companies behind these activities, resulting in a database of nearly 4,000 firms.

Facebook and its missteps with Cambridge Analytica may represent an extreme case of collecting and tracking, but that’s only the tip of the iceberg. About $250B in global digital advertising spend and nearly $3 trillion in e-commerce revenue are powered by these activities. But implementation of GDPR in Europe at month’s end — combined with the Sturm und Drang surrounding Facebook — has started exposing what’s really happening behind the scenes.

The worst scan that I’ve ever seen is shown in that graphic, with each dot representing one or more JavaScript tags, tracking pixels, ads and other third-party calls in a single page. The unlucky publisher, whose identity I conceal out of pity, is a well-known U.S. news site with a fairly promiscuous attitude — 30x worse than average — towards bringing external code into its pages.

Lest you start reaching for pearls to clutch, data collection and tracking isn’t — and shouldn’t — go away. The Internet that we know and love, plus trillions of dollars of created value, wouldn’t be possible without it.

But the industry, consumers and regulators are rapidly realizing that a line has been crossed, as this data feeding frenzy degrades user experience, causes compliance violations, and sucks untold value from unsuspecting online trading partners.

It’s time to rethink and reboot, to find a happy medium where profits can continue in parallel with enhanced transparency, accountability and respect for consumer privacy. As the saying goes, pigs get fat but hogs get slaughtered.

But first, what’s really going on

For simplicity’s sake, we’ll refer to site operators as publishers, but the same thing happens to a degree for sites run by e-commerce companies, brands, governments, non-profits and anyone else with a digital presence. Similarly, we’ll use the word tag to refer to any object called by a webpage, even though there are hundreds of different types of code that could be involved.

A web page may appear monolithic to the visitor — everything you see appears to be coming from the servers of the site operator — but that’s rarely how it works.

Think about the typical functions of the average webpage and nearly all of them have some third-party involvement. Harnessing the high-tech industry has been liberating for website operators, allowing them to quickly tap into rich capabilities at low or no cost, but it also transforms every web page into an Internet flash mob.

Here are some different ways that third-party companies can get enmeshed in your average media or e-commerce website (with examples of each company type in parentheses):

Content (text, photos, videos) created by the publisher, but is often handled by cloud-based software (WordPress, Squarespace) with some involvement from content delivery networks or CDNs (Akamai, Fastly).
Analytics (Adobe, Chartbeat) that keep track of visitor actions, location and type of browser/OS/machine.
Social sharing involves social networks (Facebook, Twitter) but also tools that tie into multiple services (AddThis, ShareThis).
Commenting and review platforms allow visitors to discuss articles (Disqus, Livefyre), as well as provide feedback on products and services on e-commerce and listing sites (Bazaarvoice, TrustPilot).
Ads of all types: search (Google, Microsoft), banner ads on desktop (AppNexus, OpenX) and mobile (Yieldmo, AdColony), video ads (FreeWheel, Adap.tv), ads that blend into surrounding content (TripleLift, Nativo), plus thousands of companies that provide the behind-the-scenes plumbing (Tapad, Moat, Integral Ad Science, DoubleVerify).
Systems that power a majority of e-commerce sites on the Internet (Magento, Shopify).
Data management platforms (Krux, Lotame) that collect user data and funnel it to other systems to better target ads or personalize pages, as well as companies that collect and sell data (Acxiom, Bombora) and determine your location based on IP address (MaxMind, Cuebiq).
Customer relationship management systems that help businesses keep track of all customer interactions (Salesforce, Marketo).
Tools that personalize content and presentation for each user (Monetate, Bloomreach) and perform testing on different page designs to enhance engagement (Optimizely, Maxymiser).
Content recommendation services (Taboola, Outbrain) provide those “Recommended for you” boxes at the end of many news site articles that also generate revenue for site operators.
Payment systems that that handle money for all types of sites (BrainTree, Stripe) as well as content payment (paywall) platforms (Piano, Clickshare).
Tools for engineers, such as system monitoring (Catchpoint, Cedexis), error detection (Sentry, Bugsnag), tag management (Tealium, Ensighten), security (PerimeterX, Proofpoint), and data handling (Keen IO, Alooma).
A myriad of other vendors such as customer service (Zendesk, LiveChat), video platforms (JW Player, Kaltura), and survey providers (Qualaroo, SurveyMonkey).
There are even companies that track all of this and who’s who (Datanyze, BuiltWith).

Most of the data collected on a web page is anonymous and doesn’t contain personally identifiable information (PII) such as name, email address, or identification numbers, although the definition of PII in Europe is expanding with GDPR.

Your web browser loads up the HTML page associated with the URL — basically, a big text file — and starts parsing it line by line to build the page and associated functionality.

A lot of the code relates to page content, but every so often, the browser hits something like this:

This is the bootstrap JavaScript code for Google’s DoubleClick ad manager (called GPT). It requires a call to Google’s servers (googletagservices.com) to retrieve the script, which then is executed by the browser.

In turn, Google makes cascading calls to other ad technology providers to perform other functions, and these companies often make their own calls to even more companies. While it’s different from site to site, in the end, you could get something that looks like this:

*All* *logos* and brands are property of *their respective owners* and are for identification purposes only

Read this from top to bottom: the DoubleClick box represents the GPT tag and every line that is connected to it is a separate tag loaded onto the page by Google or another vendor. So in this example, one tag resulted in 15 additional companies being brought into the user’s browser.

While the site operator deliberately placed Google’s code in the webpage, the vendors downstream from Google often have no legal relationship with the company operating the website. A TV network executive once told me that more than 500 companies interacted with his 100+ million monthly visitors, but he had contracts with only 30 of them.

Here’s one example that we saw on a major publisher’s site that was 15 layers deep:

Sometimes this can go on for 40+ iterations in a single pageview, with some vendors well known and others obscure. For the visitor to that website, this means your browser retrieves the page from the publisher and then places 40 separate sequential point-to-point calls to each vendor, bypassing the site operator.

It’s like throwing a party and inviting your friends, who in turn invite their friends, and so forth until an intimate Saturday night gathering ends up filling a football stadium. Some partygoers add spice to gathering and are welcome additions while others drink all of your booze and throw up in your closet.

Similarly, most tag vendors (including the ones listed above) provide valuable services to the site operator and the visitor, while others are up to no good, with activities ranging from delivering malware to cookie licking (sounds fun, but it isn’t — more later).

It’s easy to figure out most of the companies’ identities by examining domain registrations that link technical calls to corporate names, such as adnxs.com for AppNexus or w55c.net for Dataxu. But about 3% of all tag calls purposely use private domain registration to obscure this connection.

Doesn’t mean there’s something nefarious, but masking creates opacity that makes it even more difficult for site operators and consumers to understand already complex transactions. Sometimes, this is done for defensive purposes, like one ad tech vendor that tries to confuse consumer ad blocking software by using comically random domain names such as SummerHamster.com, AtticWicket.com, and SilkenThreadiness.com.

Regardless of how they arrived in the page, most tags also have unfettered access to hundreds of data elements about users and pages: what content is being viewed, what actions are taken, where users just came from, their technology set-up, and the like. Again, this usually doesn’t include PII, but there’s nothing to stop the tag provider from scraping a form to gather names or email addresses.

Oftentimes, the data is often associated with an anonymized user ID stored in a cookie, which can help build up detailed profiles of user behavior across every site a tag is present. You don’t need to be Facebook or Google to do this; there are dozens of companies — many consumers have never heard of — that have tags on 10%+ of the world’s top websites.

Such data access and tracking is required to make the Internet hum, but when is enough enough? At a recent industry event I attended, a panelist from a large data company was asked how much information he’d like to be able to collect from site visitors. “All of it,” was his response, with only a hint of irony.

As an aside, data collection from mobile apps takes a completely different route, and while keeping track of the trackers is an issue for app publishers, it’s not nearly as Wild West as desktop and mobile web.

Which vendors call which vendors on 100 pageviews of global entertainment site. This diagram shows how eXelate arrives in the page, with the green lines representing parent tags (who calls eXelate) and red lines showing this tags are called by eXelate.

Why it’s getting worse

A founding father of the ad tech industry once told me about making the rounds of publishers in the 1990s, asking them to insert a tag from his start-up into their pages to enable his ad server.

Paraphrasing a typical response: “We’d be crazy to put third party code on our site where it could spy on visitors and have its way with them and our pages.”

Jumping forward just a few years to the dawn of Web 2.0 (a term now as quaint as an eight-track cartridge), the architecture of the web began to change. Static flat HTML pages became enlivened with JavaScript that regularly communicated to remote servers, enabling site operators to more easily use third-party services. Combined with CSS and other emerging technologies, the web became a much more dynamic place. But also messier.

Then came the rise of behavioral targeted advertising — using data to select which users see which ads — and programmatic — which conducts automated auctions for every ad impression, powered by reams of ad collected by thousands of companies. Throw in the cloud computing revolution, where Amazon Web Services (and later, others), made it easier and cheaper for start-ups to launch without big capital outlays, and it led to even more complexity:

The Internet obeys the Second Law of Thermodynamics, as entropy only increases. Each dot represents a tag in a typical website from three different eras.

After analyzing thousands of sites, I’ve seen a few patterns that contribute to this growing problem:

Poor housecleaning

Websites are like Christmas trees, and sometimes publishers lose track of the ornaments hung on them. I worked for a news industry consortium that had tags on thousands of top media sites. The consortium folded, but a year later, I could still find tags of dozens of namebrand sites, pinging away into nothingness. More recently, I audited another site and found a third-party tag that was deployed as a monthlong test three years ago. It had accidentally remained live ever since (due to staffing changes), legally collecting data on billions of pageviews.

Ad partner promiscuity

Programmatic advertising auctions can lead to dozens — sometimes hundreds — of ad tech companies participating in the bidding process. But in order to determine the price to bid for a specific ad shown to a specific visitor, they must inspect whether that user has been seen elsewhere on the internet (cookie IDs, no real names or PII) and whether s/he lines up with any valuable audience segments desired by advertisers.

This cookie-syncing process creates a pattern of tag calls we’ve labeled a starburst as it resemble fireworks. A single auction for a single ad unit may have several starbursts, each consisting of 10–50 companies, and one page may have multiple ad units. Depending on how publishers configure their ad tech, they may have little knowledge or control of these downstream calls. Here’s an example of a typical series of calls and the resulting starbursts for a single page, resulting in more than 50 companies.

Tag “starbursts” is a series of ad calls, with Amazon and PulsePoint loading dozens of other ad and data vendors to perform cookie syncing.

Sometimes a single ad spot is instantly resold once the initial auction closes, which creates even more starbursts. Major ad exchanges frown upon this both because it obfuscates the transaction, may bring in banned buyers, and creates even more starbursts.

This can be complicated even further depending on how aggressively publishers monetize their pages and restrictions on the types of bidders in programmatic auctions. Flinging the doors open wide means that bottom-feeding ad tech companies and their less-than-reputable clients come calling. We saw a network of sites that did this, resulting in a high incidence of ads with unwanted redirects, offers to install toolbars and applications (read: adware and maybe malware) and a generally crappy user experience. Some of these C-list advertisers get around fraud filtering by using hundreds — sometimes thousands — of domain names that map to the same page.

Unexpected visitors and actions

What’s acceptable in terms of tag actions is contextual to the vendor and relationship with the site operator. While a video provider is expected to change the user interface to insert a player, an analytics vendor shouldn’t affect the page’s visual presentation. At a major news publisher, we saw a longtime widget vendor suddenly begin sniffing the Facebook API and grabbing user data, a violation of their terms of service (although infrequently enforced, as recent events have shown). It ended up being triggered by a bug in the vendor’s code — an innocent mistake, but unnoticed by the publisher for months.

More frequent than these missteps are unusual “ridealong” tags from seemingly unrelated companies. For instance, embedding the below image from a leading GIF provider gets you a pretty kitty plus a handful of analytics, application monitoring and data collection tags. The company uses this to understand who’s seeing the GIF, but it has the effect of gathering a complete set of audience data from the publisher’s site. Totally legal, but usually a surprise to site operators when it’s discovered.

Ridealong tags that accompany a simple GIF.

Why everyone should care

To borrow from Kierkegaard, the Internet can often only be understood backwards, but it progresses forwards regardless. This sentiment as it pertains to digital media is aptly expressed by Tony Pace, former chairman of the Association of National Advertisers, which represents top US brands.

“When you look at the digital media supply chain, who the hell would have ever constructed this thing this way if they wanted it to make sense and to have longevity?”

A similar case can be made for the Internet world beyond the publishing and ad tech, creating concerns for consumers and digital enterprises alike. Where there’s mystery, there’s also margin and mayhem.

In my next post, I’ll delve into why this is a bad deal for almost everyone — including those making money today — and discuss some paths forward to encourage more sanity and responsibility.

Joseph Galarneau is a longtime digital leader who founded a New York-based start-up, Mezzobit, that focused on data awareness before being acquired by OpenX, one of the world’s largest ad exchanges. Previously, he was COO and digital GM at Newsweek and chief product officer for a news industry analytics consortium. A certified privacy technologist by the International Association of Privacy Professionals, he’s also taught digital strategy and analytics at New York University and Yale University, consulted on data leakage for the major digital publisher and advertiser trade groups, and conducted workshops on socially responsible data collection at SxSW Interactive and the Mozilla Festival.