Courts docs show how Google slices users into “millions of buckets”

The online giant probably knows more about you than the NSA — including things you might not even tell your mother

The first law of selling is to know your customer. This simple maxim has made Google into the world’s largest purveyor of advertisements, bringing in more ad revenue this year than all the world’s newspapers combined. What makes Google so valuable to advertisers is that it knows more about their customers — that is to say, about you — than anyone else.

Where does Google get this knowledge? Simple. It watches most everything you do and say online — reads your email (paying special attention to purchase confirmations), peers over your shoulder while you browse, knows what you watch on YouTube, and — by tracking your devices — even knows where you are at this very moment. Then it assembles all these bits of information into a constantly updated profile that tells advertisers when, where and what you may hanker to buy.

Your Google profile contains far more than basic facts such as age, gender and product categories you might be interested in. It also makes statistically plausible guesses about things you didn’t voluntarily disclose. It estimates how much you earn by looking up IRS income data for your zip code. It knows if you have children at home — a trick it performs by surveying hundreds of thousands of parents, observing their online behavior, then extrapolating to millions of other users. Google also offers advertisers over 1,000 “interest-based advertising” categories to target users by their web browsing habits. When advertisers are ready to buy ads they can review all these attributes in a convenient browser interface and select exactly the users they want to target.

But these explicit attributes only scratch the surface. The online ad giant knows much more about you than it can put into a form easily understandable by humans. Just how much it knows came to light last year, when a Federal judge ordered the publication of some remarkable internal Google emails discussing how Gmail data mining works. Google’s lawyers fought the disclosure tooth and nail, but they were ultimately overruled. The emails reveal that Gmail can sort users not just into a few thousand demographic and interest categories, but into literally millions of distinct “buckets”. A “bucket” is just a cluster of users, however small, who share some feature in common that might interest advertisers.

Gmail profiling is performed by a mysterious device known as the Content OneBox, or COB for short. The COB’s inner workings are shrouded in mystery — the court documents describing it are heavily redacted. But we know that it deploys many distinct data mining methods to place users into these “buckets”. In addition to exploiting demographic or interest-based observables, the COB tries to understand the actual meaning of email messages with advanced “machine learning” algorithms.

The COB lets Google peer into the most intimate aspects of your life. To see why, think about the math. Gmail has a billion users who collectively receive several trillion messages per year (not counting spam). The COB analyzes every one of those messages, even the ones it classifies as spam and the ones you delete before opening. With such a vast sea of data pouring into its algorithms every day, Google can make incredibly fine-grained distinctions among users. Inevitably some of these distinctions will correspond to sensitive personal attributes that few of us would voluntarily disclose. Most “buckets” that Google creates have no names — they are just nodes in a vast network of associations. But Google can also sort advertising messages into the same buckets and then match them to similar users, without the need for overt labels. All it takes to connect an advertiser to the right customer is a fleeting pattern of affinity between some group of users and a particular sales pitch.

Google doesn’t allow advertisers to target explicitly on sensitive categories such as race, religion, health status or sexual preference. But nothing prevents the COB’s matching algorithms from performing such targeting implicitly. When you have millions of target buckets and can perform thousands of statistical experiments every day, you can achieve the same result. Want to attract young gay men of color in certain tough urban neighborhoods to your new line of expensive athletic shoes? Google won’t let you target those keywords directly, but millions of buckets can do it for you — implicitly. Want to offer fast food discount coupons to overweight single women with children, heart surgery to middle-aged men with chest pain and high incomes, or steroid pills to teenage body builders? Millions of buckets can do that too.

In the ten years since it was launched Gmail has morphed from a simple email service into a gigantic user profiling machine. The power of its profiling algorithms increases each year as new ideas from machine learning research are made operational. At the same time, the amount of data the algorithms have to work on swells constantly. Google execs announced in February that Gmail has reached one billion users.

Today the market for such data-mined personal profiles is essentially unregulated. Neither Congress nor the FTC know what to do about it. A few state legislatures, notably California, have put stakes in the ground trying to protect consumers and especially vulnerable populations such as school children. The European Union is also debating new, stricter regulations. At the same time, surveys shows that consumers care about privacy but are easily lured into giving it up for free services.

This wild west of unrestrained online profiling can’t go on indefinitely. It is particularly ironic that the National Security Agency — despite all the recent controversy — is subject to far tighter legal oversight than online advertisers like Google or Facebook. Sooner or later regulation must come to online profiling. Europe is likely to lead the way, though not necessarily in a manner that meshes well with American views on innovation and freedom of expression. In the United States, considering Congressional gridlock and the risk of inconsistency among state legislatures, the best near-term hope for regulation that is both reasonable and effective may lie with the FTC or perhaps even the White House. Time will tell, but time is growing short in a world where machines grow more intelligent every day.