The Natural History of Gmail Data Mining

Gmail isn’t really about email — it’s a gigantic profiling machine

A court case reveals a trove of documents about Gmail’s inner workings

[Updated in November 2014 to include additional facts about Gmail data mining, particularly the disclosure of Google internal documents revealing that Gmail is capable of classifying users into “millions of buckets”.]

In late 2010 a pair of obscure trial lawyers in Texarkana, Texas, who had apparently not been following the latest news from Mountain View, made what was to them a momentous discovery: ads in Gmail are correlated with keywords contained in the emails. The seeming naïveté of this pair regarding modern Internet business models did not, however, extend into the courtroom. This being Texas, and these being trial lawyers, they knew exactly what to do. Within weeks they had filed a class action lawsuit charging that Google’s data mining business model was a form of illegal wiretapping. Hundreds of millions of users of Gmail and Google Apps, they said, had never consented to having their emails intercepted and scanned for purposes of targeted advertising. If the Texas trial lawyers prevailed, Google would have to pay billions in compensation.

The Texarkana case may have started small, but it eventually morphed into the most serious legal challenge Google has ever faced over its core business model of analyzing online user behavior for profit. The legal intricacies of the case have been widely covered elsewhere and there is no need to recount them here. Indeed the real interest of the case lies not in the legal proceedings, but in what it tells us about the hitherto hidden inner workings of Gmail. Studded in the thousands of pages of the case’s public record is never-before seen information that reveals how and why Google conducts its data mining enterprise and how its methods have evolved over time. Later in this article we will let those gems tell their story.

Tech and media industry insiders often struggle to believe that there are still people who don’t understand how Google makes money. But the surprising reality is that vast numbers of well-informed people outside this insider elite have no clue that Google is the world’s largest advertising company. They do not imagine that after only 15 years in business Google makes more money from ads than all the world’s newspapers combined. They still see it as just a handy search engine that offers free email on the side and puts nice maps on their phones. They are more likely to have heard about driverless cars than to understand the term “data mining”.

After several years of sparring, the Texarkana case was transferred at Google’s request to Federal District Court in San Jose, where things began to heat up. Along the way it gathered numerous other lawyers and their plaintiffs, all seeking a piece of the pie. In California the case wound up in the hands of Judge Lucy Koh. The daughter of Korean immigrants, Koh grew up in Mississippi, where her parents owned a sandwich shop. After Harvard College and Harvard Law, she rose to a seat on the Federal bench at the young age of 42.

In just four years Judge Koh has become a superstar of high stakes Silicon Valley litigation. In addition to the Gmail case, she has presided over the Apple-Samsung patent battle and the case of the Google-Apple-Intel wage fixing cartel. In the courtroom her demeanor is low key, but her written rulings display a deft and confident intellect. Whether in writing or on the bench, she is swift to slap down lawyers who fail to take her measure. When an Apple lawyer in the Samsung patent case sought at the last minute to call a large number of additional witnesses, she famously retorted that he “must be smoking crack” if he thought she would grant such a request.

After a long string of courtroom defeats — including a stinging rebuttal by Judge Koh of its claims to have obtained adequate user consent for the targeting of ads based on email content — Google now appears to have regained the upper hand. In late May, after the case failed to win the class certification that would have entitled the lawyers to claim damages on behalf of hundreds of millions of users, all but one of the plaintiffs settled out of court for an undisclosed payment. The case of the lone remaining plaintiff, a 16 year-old minor user of Gmail, is currently scheduled for trial in Judge Koh’s San Jose courtroom in October.

[Update: the last plaintiff settled in July 2014, thus ending the case, except for a skirmish between Google’s lawyers and a group of leading new organizations over Google’s attempts to keep most of the case’s record under seal. Judge Koh finally settled this issue in August, authorizing most but not of all of the redactions demanded by Google. One document she did allow to see the light of day is a fascinating email exchange between Google employees about Gmail’s targeting capabilities — see discussion below.]

Gmail’s early history

Gmail was launched on April Fools’ Day 2004. Its true beginning though dates to the middle of 2001, when a Google engineer named Paul Buchheit started work on a web email project he had been thinking about since college. Web email wasn’t a new idea even then. Yahoo and Microsoft’s Hotmail had been doing it since the 90s. But Gmail introduced several innovations that would end up having a transformative effect on the whole category.

First, the product would leverage Google’s greatest strength — search. None of the other web email services had sophisticated search functions, nor did Microsoft’s Outlook email client software. But search implied something to search on, and this led to the second innovation — a vast amount of storage space per user. Hotmail at the time offered only a miserly two megabytes — not even enough to store a small PowerPoint slide deck with a few graphics. Gmail would offer a stunning one gigabyte (10 years later the figure has risen to 15 GB).

Finally, Gmail would offer an interactive user interface based on JavaScript. Gmail would be far more than a static web page updated by hitting the refresh button. It would be a true web application whose locally executed code offered functionality formerly available only in client-based software like Outlook. As such, it became one of the first and most influential examples of a new style of web design known as “Web 2.0”. While the buzz word itself is now mostly forgotten, the style lives on in nearly every widely used web service today.

From its earliest days Gmail was intended to be a money-making product. Like existing web email services, it would be free to users and earn revenue through advertising. But this being Google, the method of serving ads in email would be very different. Instead of relying on demographic information users provided about themselves at sign up, Gmail would attempt to grasp the actual meaning of user messages and target ads accordingly.

Beginning in September 2002, Buchheit, another Google engineer named Georges Harik and several of their colleagues filed a series of patent applications for this idea. In one patent filed in June 2003, when Gmail’s public launch was still nearly a year in the future, they described a lengthy series of “internal” and “external” message attributes that could be used in any combination to extract the meaning of an email and select the best ads to match it (see table). The examination of these attributes reveals much about the scale of Google’s ambitions.

Gmail’s limitless data mining ambitions

The most striking thing about the early Gmail patents is how exhaustive they were in attempting to anticipate every conceivable attribute of an email message that might one day be exploited for ad targeting purposes. In many cases it would be years before Google was actually able to make these ideas operational in Gmail. The first version of ad serving in Gmail exploited only concepts directly extracted from message texts and did little or no user profiling — this method would only be put into practice much later. Some attributes have still not been implemented today and perhaps never will be. For example, as far as I know, Google does not reach into your PC’s file system to examine other files residing in the same directory as the file you attach to a Gmail message, even though the patents explicitly describe this possibility.

Gmail doesn’t make much money from ads

The Gmail patents were more than just a theoretical exercise. When Gmail was finally released to the public in April 2004, its ad serving system used a sophisticated data mining algorithm known as PHIL, the subject of another Google patent filed by Georges Harik and a colleague. Already implemented the previous year in Google’s AdSense program that serves ads to web sites operated by third party publishers, PHIL stands for Probabilistic Hierarchical Inferential Learner. Despite the forbidding name, the basic idea is straightforward.

Words in documents such as emails occur not randomly but in certain clusters. When allowed to crunch through a vast number of such documents, simple software algorithms can identify clusters that are more or less likely to occur and group them together as “concepts”. For example, PHIL can learn to distinguish the entirely different meanings of two concepts such as “ski resort” and “lender of last resort” without being tripped up by the fact that the term “resort” occurs in both.

In AdSense, PHIL matched concepts derived from sets of keywords provided by advertisers with concepts extracted from the web pages where publishers wanted Google to place ads. The idea was that the better the match, the more likely a visitor to the publisher’s site would be to click on the ad, which was the revenue generating event for Google. Launched in March 2003 — 13 months before Gmail — AdSense quickly grew to become Google’s second largest business after search itself, reaching more than $1 million a day by 2004 and $13 billion a year by 2013.

PHIL bears an obvious and striking resemblance to the “prototype theory” developed in the 1970s by Berkeley cognitive psychologist Eleanor Rosch. Google’s Georges Harik, the primary author of the PHIL patent, earned a PhD in artificial intelligence at the University of Michigan in the 1990s and likely would have been familiar with Rosch’s work or that of her disciples such as Berkeley linguist George Lakoff. PHIL also resembles clustering algorithms developed decades earlier by market researchers interested in grouping consumers into segments, for example as used in the Nielsen PRIZM system discussed below.

After seeing the early AdSense results, using PHIL for monetization in Gmail must have seemed like a no-brainer to the Google managers planning the fledgling email service’s launch in early 2004. But things did not work out as hoped. While Gmail was a wild success with the public from day one and quickly saw its user numbers surge first into the millions and then tens of millions and ultimately hundreds of millions, its revenue story has been far less glorious. Google regularly announces revenues from AdWords (search advertising) and AdSense in its quarterly Wall Street conference calls, but it has never disclosed a number for Gmail.

Since Gmail is easily the best known Google-branded product after search itself, this reticence is surprising. But there appears to be good reason for it. Analyst estimates put Gmail revenues for 2014 at barely $400 million, or less than 1% of Google’s total revenue. Yet Gmail on its 10th Anniversary in April 2014 was estimated to have over 500 million users worldwide. In other words, the average Gmail user produces less than $1 in revenue per year. However fabulous the scale and scope efficiencies of Google’s vast data centers, it is exceedingly difficult to believe that the cost of maintaining the average Gmail user falls below $1 per year.

Consider for example the cost of storage alone. In March 2014 Google, engaging in a vigorous price war with Amazon Web Services, cut its price on cloud storage to 31 cents per year per gigabyte. If the average Gmail user consumes only 20% of their nominally allotted 15 gigabytes, Google’s retail price for this amount of storage would be 93 cents, more than the revenue it gets from one Gmail user. Yet providing a service like Gmail requires far more in the way of infrastructure than just storage — bandwidth, compute power and the amortized cost of the software (though not user assistance) are also part of the mix.

[Update: in October 2014 Google’s Sundar Pichai announced that Gmail had reached 750 million users worldwide and was “on its way to one billion users”. However, he said nothing about the service’s revenue. $1 per user per year continues to look like the high end of plausible Gmail revenue estimates.]

Conclusion: it is a near mathematical certainty that Gmail as a direct revenue proposition loses money for Google, probably hundreds of millions — or perhaps even a few billion — dollars per year.

Why is revenue generation in Gmail so much weaker than for search or AdSense? The widespread view in the online advertising world is that people are inherently less receptive to advertising when they are doing email. When users search the web they are by definition looking for specific information. This information seeking behavior predisposes them to respond favorably to ads that are related to their quest. The same is true to a lesser but still significant extent when users view content on a publisher’s web site. But when they are reading or writing email, they are not looking for information. Instead they are focusing on expressing their thoughts or learning those of their correspondents.

Google is now trying to improve Gmail revenue by delivering ads in the form of email directly to customer inboxes, rather than placing them as snippets of text in the surrounding window. This approach was made possible by the 2013 introduction of the tabbed inbox, which uses sophisticated data mining to recognize certain emails as commercial offers and shunt them into a separate “Promotions” tab. Categorizing email in this way has the key advantage for Google of allowing it to show ads to mobile Gmail users, something that was previously impossible due to the limited real estate available on mobile web clients. It also has the convenient consequence of funneling commercial emails that compete with Google’s advertising into a single place where Google can insert its own revenue-bearing messages.

The revenue impact of this innovation is not yet known and is in any case not likely to be disclosed by Google. But as we shall see, Google may already have solved its Gmail revenue problem by entirely different means. For it turns out that the vast river of emails flowing incessantly into the inboxes of half a billion users carries with it countless nuggets of information that can be mined to produce value in places other than in Gmail itself.

From ads to user profiles

Beginning in 2003 and continuing to the present, Google has filed a range of patents describing the use of various kinds of user profiles to improve both ad response rates and — perhaps even more significantly — the quality of search results.

The idea of exploiting information about users to improve ad targeting is not new or exclusive to Google. Indeed it is much older than web advertising itself. Companies like Nielsen have for decades used market research and data from many other sources such as the U.S. Census to create elaborate geographic, demographic and psychographic segmentations of consumers. Systems like Nielsen PRIZM, for example, originally developed in the 1970s, use sophisticated clustering algorithms not unlike Google’s own data mining techniques to divide Americans into such marketer-relevant buckets as Upper Crust, Blue Blood, Young Digerati, Beltway Boomers, Rustic Elders, Back Country Folks and Hard Scrabble, among dozens of others.

The idea behind all these systems is that tailoring messages to specific audiences increases the advertiser’s return on investment. Whether the messages are delivered in prime time TV ads, late night infomercials, traditional direct mail or highly targeted snippets of online text, they all cost money and advertisers naturally have a powerful incentive to optimize their spend. Knowing your customer is the key to successful selling. While you might imagine that Upper Crust or Young Digerati audiences are inherently more desirable than Rustic Elders or Hard Scrabble, this is not always the case. What may be true for BMW or Apple is not necessarily true for KFC or Coke.

How does Google’s online profiling work? At its core are the same patented PHIL clustering and concept extraction methods described above. A user (or group of users) can be described by various kinds of clusters. The simplest kind are clusters of terms used in documents created or viewed by the user. Another kind derives from the URLs of documents the user has viewed or perhaps forwarded to others by email or social media. A third kind — the most comprehensive — consists of the concept or category clusters extracted by the PHIL algorithm from documents the user has viewed (web pages, inbound emails) or created (outbound emails, social media posts).

Inbound emails to Gmail users are of particular value to Google in creating user profiles that make its targeting more effective. Consider the emails you receive in a typical day. They obviously include messages from family and friends, social media notifications, newsletters you subscribe to and whatever commercial offers have made it through your spam filter settings. They also typically include a large amount of data-rich correspondence from institutions — banks, utilities, schools, tax authorities, cable TV companies and — last but not least — online merchants such as Amazon, eBay or travel reservation sites where you have made purchases. Taken together, these inbound messages discriminate you from other users with a high degree of granularity, in much the same way that the number and quality of inbound links serve in Google’s famous PageRank algorithm to compute the relative rank of web pages.

Assuming conservatively that the average Gmail user receives just 10 non-spam emails per day, the annual flux of inbound Gmail probably approaches and may well surpass two trillion messages per year. That is a lot of content to subject to the discriminating eyes of PHIL and Google’s other data mining algorithms.

By building and continually updating a vast database of individual user profiles, Google can discern that one particular user who enters the word “blackberry” into her browser is likely interested in the fruit, while another user who types the same word is looking for a certain kind of phone. It can then choose which ads are appropriate to display with the search results. It can also choose which search results to display first. Google’s PageRank algorithm, which computes an aggregate statistical view of each web page’s relative importance based on many signals unrelated to individual users, will likely rank the phone “blackberry” above the fruit “blackberry”, and thus will return results that are irrelevant for the first user. This is a missed opportunity for Google not only to capture potential ad revenue, but even more importantly to satisfy a search user and thereby dissuade her from abandoning Google in favor of alternative methods of searching.

Retaining and satisfying search users is without doubt Google’s most important business objective, because search continues to account for the lion’s share of its revenues and virtually all of its profit. Consequently, data mining methods that optimize the relevance of search results for users are of great strategic value to Google. The fact that Gmail even after 10 years in business continues to lose a significant amount of money on the back of very low revenue is therefore not necessarily of great concern to Google. The trillions of inbound documents that Gmail feeds into the maw of Google’s vast user profiling machinery more than make up for these losses.

[Update: Over the objections of Google’s lawyers, in August Judge Koh ordered the publication of a series of emails by Google employees in 2009 discussing the extraordinary power of Gmail’s user profiling algorithms. At the time AdSense’s “interested-based advertising” (IBA) was able to group users into only 700 or so categories based on the web sites they browsed. But Gmail, using classification algorithms such as PHIL and no doubt others as well, would divide users into literally “millions of buckets”. The value of such “Gmail user profile extraction” for “user based targeting” in Google’s other ad-based services such as AdSense was obvious to these employees, who eagerly discussed the new targeting opportunities it would make possible. Below are copies of this correspondence (obtained from PACER).]

Gmail profiles all users, even if they don’t see ads

We are now ready to turn to the court disclosures about Gmail’s data mining methods mentioned in the first part of this article. These details are contained in several key documents from the court proceedings, all of which are published on the PACER online public access system for U.S. Federal court documents. In many cases the published versions of these documents have been redacted — that is, censored — at Google’s demand. The documents contain long passages that describe the purposes, mode of operation and evolution over time of Gmail’s complex data mining apparatus. But often key phrases or entire sentences are blacked out, making it difficult to assemble a coherent vision of the whole.

Google’s lawyers argue that the redactions are necessary to prevent its competitors from gaining insight into “sensitive aspects of Google’s proprietary systems and internal decision-making processes”. They also claim that purveyors of malware and spam could use the information to circumvent Google’s systems for countering these threats. These arguments are difficult to take at face value. While the documents describe Gmail data mining in some detail, they remain at a high level and provide no hint as to what the millions of lines of software code that make up this system actually look like.

The true aim of Google’s redactions is almost certainly to prevent the public at large from understanding how its business really works. The redactions have been challenged by a group of news organizations including the New York Times, the Washington Post, NPR, Politico, Forbes, Gannett, and McClatchy. At the time of writing it is unknown when or how Judge Koh will rule on the unsealing motion submitted by these organizations. Nevertheless, in an important concession, the court has allowed open discussion of the documents by lawyers during the case’s public hearings. While the transcripts of these public hearings published on PACER are themselves redacted, observers who were present at the hearings are able to fill in the blacked out sections. As one such observer, I can thus offer my own reconstruction of the hitherto hidden inner workings of the Gmail data mining machine.

In fact, the documents paint a picture of Gmail’s block architecture that would likely be obvious to competitors and spammers — who already have extensive practical experience in the design of such systems — but to no one else. Does malware and first cut spam filtering happen before concept extraction? Yes — this sequencing at least is not surprising. But could messages be passed through spam filters more than once, perhaps both before and after concept extraction? Although not self-evident, this is in fact what happens. Can message content be exploited to make inferences about users — for example, to divine their age and various demographic and psychographic traits — even when messages are not analyzed for ad serving purposes? Since it turns out that these two actions are implemented by Google in entirely distinct software modules, nothing stands in the way of separating them in this manner.

Does user profiling happen after a message is delivered to the user’s inbox or before? Actually both options are possible. Where you put user profiling in the email delivery process depends on why you are doing it. If you are using it just to target ads in the immediate message context, you can do it at the same time as ad selection, which in Gmail is initiated when a message is opened. But if you want to maintain profiles of every Gmail user for broader purposes — even of users who do not see Gmail ads — then it is better to perform the profiling upstream of the ad serving process.

One Box to rule them all

Perhaps the most significant revelation in the court documents concerns precisely this question of where user profiling is located in the Gmail pipeline. It turns out that this location is highly strategic and has undergone a fundamental change over the course of Gmail’s history. The precise timing of Gmail’s shift from a purely ad-based business model to one that combines ads and user profiling is unknown. The dates of the user profiling patent applications don’t necessarily tell us when Google actually implemented these ideas. But we know that some time prior to September 2010 Google introduced the first of a series of user profiling processes that it ultimately grouped together in an umbrella process known by the robotic-sounding yet oddly evocative name of “Content OneBox” (often abbreviated as “COB”).

The court documents tell us that COB is a master server process with distinct sub-components that performs multiple kinds of content analysis on Gmail messages, including but not limited to user profiling. It is in effect “one box to rule them all”. Among its functions are the PHIL-based extraction of message concepts described previously, updating the “user model” that Google maintains of each user, and attaching “smart labels” to messages that indicate their type (receipt, personal, social, promotion, etc.)

Content OneBox was originally located in what the documents refer to as Gmail’s storage area, which is distinct from its upstream message delivery process. COB operated after a message was delivered to a user’s inbox and opened by the user. It is crucial to understand that the content analysis performed by COB is not the same as ad selection. The latter process is performed by an entirely separate entity known as the “CAT2 Mixer”, also located in Gmail’s storage area.

Ad selection in Gmail is a dynamic process that occurs each time a message is opened, because the ads best matched to a particular message can vary over time. The CAT2 Mixer is thus triggered by the message opening event. It operates by comparing metadata that COB extracts from the message with metadata obtained from advertiser keywords and perhaps other signals pertaining to candidate ads. It is the same ad matching process used by Google’s AdSense program for third party publishers, which is not part of Gmail.

At some point, probably in early 2010, Google realized that large numbers of inbound Gmail messages were escaping the all-important Content OneBox meaning extraction and user profiling process due to shifts in the way people used Gmail. By then Gmail had hundreds of millions of users. Vast numbers of them — certainly many tens of millions — were accessing Gmail from iOS or Android devices. Since these mobile clients could not display ads due to their limited surface area, the CAT2 Mixer did not trigger, and consequently neither did COB. Even for users logged into Gmail from conventional web browsers, emails that were deleted without being read or simply never opened also did not trigger COB.

Last but not least, Google had by this time acquired tens of millions of users of the institutional versions of Gmail — Google Apps for Education, Government and Business. To entice such customers to sign up in large numbers — and in the case of Government and Business to pay real money for the service — Google found that it had to promise to keep ad serving turned off by default. Many millions of high value users were thus slipping through Google’s data mining net without being profiled.

Google concluded that something needed to be done to address the large and growing gaps in Content OneBox’s coverage. During the course of September and October 2010, it took the strategic step of moving COB from Gmail’s storage area to a position upstream in the delivery pipeline — that is, before rather than after messages arrive in user inboxes. This fundamental revision in Gmail’s data mining architecture apparently took two months to complete. Curiously, the months and year during which the change was made have been redacted in most of the published case documents. However, they were spoken aloud by the plaintiff lawyers and Judge Koh during a February 2014 public hearing and are thus no longer secret.

Just why Google sought to obfuscate these dates is unclear. But the fact that it did strongly suggests that the change in COB’s location was a major milestone in Gmail’s history. After September-October 2010, the content of every email would be data mined and every Gmail user would be profiled, regardless of whether ads were served and regardless of whether the user was an ordinary consumer, a middle school student, a government employee or a Fortune 500 CEO. The transition of Gmail from a business whose value to Google depended solely on advertising to one based equally or perhaps primarily on user profiling was now complete.

Sequence of events in the life of a Gmail message

Google profiles not just individual users, but whole classes of users

Google is the largest advertising company in the world, generating over $60 billion this year from the discovery that users will click on small text or banner ads when these are presented in a favorable context, such as search results, editorial content or (to a lesser extent) email. Yet the remarkable fact is that the vast majority of users — by some estimates 70% to 80% — never click on ads. According to online ad consultants, a typical Google ad must be seen dozens or hundreds of times by the minority of users who do occasionally click on ads to garner a single click. In short, ad clicks are rare events.

Think of Google’s business as a giant spreadsheet with hundreds of thousands of advertisers aligned across the top and hundreds of millions of users down the left side. The entire spreadsheet contains many trillions of cells. Each time a user clicks on an ad, imagine putting a number in the corresponding cell of the grid representing the revenue that Google earns from the transaction. If you step back and view the spreadsheet from afar, you will see that it is almost entirely empty. In fact, more than 99.99% of the cells have nothing in them. All of Google’s billions in revenue comes from the vanishingly small fraction of cells where a user has actually clicked on an ad. The technical term for this phenomenon is sparsity.

Sparsity is a problem for Google. Computing the likelihood of an ad click for every possible pair of users and ads in the grid may be an impossibly burdensome task, even for a firm with the unrivalled compute resources of Google. Individual user profiles, however accurate, are not enough to overcome the sheer scale of the problem. Therefore anything that reduces the sparsity of value generating events by clustering similar users (or ads) together will be of great value to Google. It is far more efficient for Google to target relevant market segments than individuals.

We thus encounter again the question of how an online advertising firm can divide its users into operationally useful clusters. One obvious method would be for Google to enrich its internally generated user profiles with information acquired from outside data brokers like Datalogix, Acxiom or Epsilon. These brokers, which offer their services to advertisers and marketers, are well-known for trawling public and private data sources to assemble astonishingly detailed portraits of millions of individual consumers and their purchasing behaviors. And indeed the Wall Street Journal reports that Google as well as other online firms such as Facebook, Yahoo, Twitter and Microsoft are working with data brokers to correlate exposure to online ads with actual retail purchases.

It’s possible that Google also uses third party data to enhance its internally generated user profiles. But as a practical matter, it would be very expensive for Google to do this for half a billion users. What’s more, data broker files on consumers can’t by themselves solve Google’s user clustering and segmentation problem, since enriching individual profiles only makes the task of computing matches between users and ads more demanding.

The ideal solution for Google would be to somehow reconstruct user demographic and psychographic segments directly from the metadata streaming out of Content OneBox and Google’s other profiling algorithms. A remarkable research paper published by Google data scientists in 2011 suggests that the company may have found just such a method. Entitled “A Tale of Two (Similar) Cities”, the paper describes an experiment in which the researchers compared millions of search queries from thousands of U.S. cities with published U.S. Census data for the same cities.

“The goal of our work is to extend techniques and data sources that have commonly been used for online single-user (or small group) understanding to extremely large groups (up to millions of users) that are usually only taken on by large studies by the Census. We want to determine whether the query stream emanating from groups of users — the inhabitants of 13,377 cities across the United States — is a good representation for the interests of the city’s inhabitants, and therefore a useful characterization of the city itself.”

The researchers report that a relatively simple mathematical analysis of city query streams allowed them to characterize individual cities and group them into clusters with much the same results as could be obtained by analyzing the hundreds of data points provided by the Census. In other words, by looking at nothing more than its own query streams Google was able to reconstruct essentially the same market segmentation as the massive and vastly more expensive operation of the once-in-a-decade U.S. Census.

The scope and power of this method are far more general than the city-to-city comparison experiment described in the paper. Although the researchers do not mention email as a data source, a similar comparative method could clearly be applied to Gmail messages. Most importantly, the method could obviously also be applied to cluster individuals rather than whole cities into segments. The potential of the method to aid Google in both ad targeting and search quality was not lost on the authors. They write in their conclusion that:

“We show that by effectively combining location information (at the city level) with search engine query logs, we can ascertain the similarity of cities — even those that may not be geographically close. Finding similar cities provides a valuable signal to advertisers and content creators. Once success (or failure) is determined for the advertiser/content creator in one city, this analysis provides a basis for setting expectations for similar cities — thereby providing advertisers and content creators new cities to target (or avoid). Additionally, knowledge of the interests inherent in a city’s population provides important information for tailoring search-engine results to deliver results with a relevant local focus.”

How might Google apply this method in practice? Assuming that it does not want to go to the trouble of buying files on millions of individual users from data brokers, there are many freely available sources of data it could turn to. The most valuable such sources for segmentation purposes may not be the consumer portraits created by the data brokers, but aggregate data collections that describe relevant and commercially valuable groups of consumers. For example, the IRS publishes mean income data by ZIP code, and real estate sites like Zillow aggregate published data on home values in hundreds of cities. These two sources alone provide an accurate map of household income and net worth that can easily be correlated with Google’s vast store of user geolocation data.

[Update: the preceding paragraphs were speculative when I first published them in June 2014, but subsequent research demonstrates that they are completely factual. For example, Google now explicitly allows advertisers to target users based on household income estimates derived from published IRS data about income by zip code. Even more interestingly, as of June 2014 it also lets advertisers target for parental status, i.e. whether the user has children at home. It’s worth quoting Google’s explanation of how it derives this attribute for its user model:

“How does parental targeting work? We used surveys to find hundred of thousands of respondents who are self-declared parents with children in their household. Our algorithm takes that data and find other users across our network that have similar Google Display Network and YouTube content consumption patterns and characteristics.”

In short, Google has operationalized the techniques developed by its researchers for correlating real-world user attributes with features extracted from their online actions. The power of this method is obvious. For example, Google can now tell its advertisers very precisely that “about 14% of users on the Internet are Moms with children in the household”. But more than that, it can tell them exactly which users are Moms with kids.]

Schools and school districts are another possible source of valuable segmentation data. They publish aggregate data on student test scores, income levels and ethnicity, and are well correlated with other geographically tagged data sources (e.g. by ZIP or Census district). Google says that its Google Apps for Education (GAFE) service has 30 million users worldwide [update: 40 million as of October 2014], of which many millions are in the United States. GAFE thus gives it a vast pool of users whose profiles it can compare with external data sources.

It would be a straightforward extension of the method described in “Tale of Two (Similar) Cities” to compare user clusters derived from GAFE with the data published by schools and school districts. Once calibrated by comparison with external data in this way, Google’s clustering algorithms would no longer need to access that data, which has the disadvantage of being cumbersome to manage and static. Instead, the algorithms could extract on a dynamic basis valuable segments of youth consumers directly from the stream of email flowing into GAFE student accounts. The resulting data could be used to target ads, improve search results or even provide Google advertisers with insights into purchasing trends among fine-grained segments of this population. For example, Google could tell brands in real time what the latest shoe buying trends are among urban teenage boys in selected cities, or which retail fashion brands are preferred by teenage girls whose families fall in a given income bracket and geographical region.

Google’s (partial) clarification of data mining in Gmail

Mining user emails for such purposes is of course not uncontroversial. In recent years Google’s data mining has faced increasing scrutiny from regulators and the media. The fiercest recent controversy concerns Google Apps for Education and was provoked by the Gmail data mining court case discussed above. In that case Google was forced to admit that, contrary to its promises to educators, it was in fact mining student emails in GAFE on a systematic basis for ad targeting and profiling purposes, and had been doing so for years. After a storm of criticism, Google announced last April that it would halt these practices. It now promises on its web site that “Google Apps for Education services do not collect or use student data for advertising purposes or create advertising profiles”.

A blog post by a Google executive also announced that it would permanently remove the longstanding option to turn ad serving back on in GAFE. Although the statement is not explicit, it likely means that the CAT2 Mixer — as discussed above, the process that selects Gmail ads once a mail has been opened — will be permanently excluded from the GAFE pipeline. However, in practice this is not a significant change, because most Google education accounts had not enabled the ad serving option anyway.

More striking is the fact that Google has not publicly said whether it will stop using Content OneBox to analyze GAFE emails. The carefully worded promise to stop using student data to create “advertising profiles” does not rule out the possibility that it will continue creating profiles that help it to optimize search results or identify valuable clusters of users.

A large-scale user profiling system like Gmail could identify many fine-grained and dynamically defined market segments of great interest to marketers without overtly adding sensitive attributes such as income, ethnicity, education level, social class or sexual preference to the metadata files of individual users, thus sidestepping many privacy issues. By analyzing millions of inbound emails from ecommerce sites and perhaps incorporating location data gathered from mobiles, such a system could enrich these segments with deep real-time insights into the actual purchasing behaviors of the underlying populations. By tracking inbound social media notifications to Gmail users, it could also allow Google to observe usage trends for competitors like Facebook and Twitter.

Significantly, the sweeping revision of Google’s privacy policy introduced in March 2012 added new language that expressly allows it to share such aggregate user data with its customers.

“We may share aggregated, non-personally identifiable information publicly and with our partners — like publishers, advertisers or connected sites.”

We cannot know for certain what Google is doing with the output of its vast and highly sophisticated email data mining machinery. But everything in its history and corporate DNA suggests that it has never encountered a body of data it does not want to analyze. When you have a truly gigantic hammer, the temptation to view everything as a nail may be irresistible.

Gmail users have “no legitimate expectation of privacy”

What makes Google’s user profiles and market segments different from traditional forms of consumer profiling is that they are generated by observing people’s private behavior in ways that the subjects themselves, although nominally informed by click-through privacy policies, often do not fully understand. In these circumstances, the issue of what constitutes consent — how explicit and unambiguous it must be to be judged authentic — becomes paramount. It is not the profiling itself that is objectionable. On the contrary, technology that lets consumers voluntarily disclose information about themselves to marketers in exchange for desirable services, and to do so effortlessly at vast scale, is of great benefit to all parties involved. Profiling only becomes objectionable — and subject to legitimate demands for regulation — when the “voluntary” part drops out of the formula.

Here we encounter a key difference between the two distinct populations of users that Gmail serves. On the one hand, there are consumers — ordinary individuals who voluntarily use an email service offered to them without charge, and who pay for it –knowingly or not — with information. On the other hand, there are members of organizations (schools, corporations, non-profit associations, government agencies, and so forth) who are provided with an email account by their organization, and who have little or no choice in the matter. The privacy rights of individuals in these two cases are fundamentally different. But Google — in what is arguably its greatest misstep in the controversy over its data mining practices — refuses to take this difference seriously.

Consumers have rights. The United States and the European Union in particular have laws and regulations that, although differently worded, make clear that a company cannot read your private email without your consent. However, the rules about how much consent is enough vary between the U.S. and Europe. EU data protection law is currently undergoing revision, and is likely headed toward a strong requirement for explicit, freely given, unambiguous consent for the kinds of personal data use that Gmail relies on.

The definition of consent under U.S. law is more controversial. In the California Gmail data mining case, Google argued that implicit user consent to data mining was sufficient, because even non-Gmail users whose emails are data mined when addressed to Gmail subscribers must:

“…impliedly consent to Google’s practices by virtue of the fact that all users of email must necessarily expect that their emails will be subject to automated processing.”

In other words, even non-Gmail users who have never seen much less consented to Google’s Terms of Service or Privacy Policy must nonetheless know that all email is subject to “automated processing”.

How can we be sure users know? Perhaps some of them read David Pogue’s column in the New York Times explaining how Gmail works. Google’s lawyers actually cite this example. But even they recognize that this is not enough. Here they engage in a transparent bit of subterfuge that Google marketers also frequently fall back on when groping for ways to justify Gmail data mining. The lawyers say that when users consent to email scanning for one reason — perhaps a perfectly uncontroversial one like spam filtering or malware detection — they thereby consent to scanning for any reason, even for activities like advertising or marketing that have nothing to do with the delivery of the service itself. Citing a previous case, Google’s lawyers make the preposterous claim that once users turn their email over to a third party service provider they no longer have any “legitimate expectation of privacy”. This is because:

“…the automated processing of email is so widely understood and accepted that the act of sending an email constitutes implied consent to automated processing as a matter of law.”

In short, because — according to this extraordinary doctrine — all forms of automated scanning are equivalent, once you agree to route your email through Google’s servers, Google is entitled to perform any manner of scanning it pleases.

These arguments were forcefully rejected by Judge Koh in pre-trial proceedings, but since the case has not yet gone to trial we cannot know for sure what their final disposition will be. Given this uncertainty and the many ongoing debates in Washington over possible revisions to U.S. privacy laws, it is hard to say where the U.S. will eventually come down on the question of what constitutes adequate user consent to data mining of consumer emails.

Gmail’s biggest privacy problem is in organizations

Nevertheless, in both the U.S. and Europe it seems clear enough that privacy and data protection laws do not in themselves present insuperable obstacles to Google’s “free services in exchange for personal information” business model. Targeted online advertising is the economic basis for much of today’s consumer Internet, and is practiced by most of Google’s largest competitors, including Facebook, Microsoft, Yahoo, and Twitter, as well as by countless smaller firms. The regulators are not saying they want to shut this model down. If they did, they would likely face a revolt from consumers and politicians. Providing that Google agrees to jump through the user notification and consent hoops that regulators want, it should be free to pursue its business model in the consumer market. Whether it wants to meet those conditions remains an open question.

But members of organizations also have rights. Data mining in email services provided to users by their organizations raises different and more difficult issues than in consumer email. When users have no say in whether to use an email service that profiles them for its own commercial purposes — as is the case when their organization subscribes to Google Apps — the regulatory and legal barriers become much higher on both sides of the Atlantic. In Europe, the data protection authorities have ruled that consent for data mining must be “freely given, specific and constitute an informed indication of the data subject’s wishes”, and have also held that consent given under constraint (such as exists in an employer-employee relationship) is not valid. The arguments about implied consent advanced by Google’s lawyers in California clearly cannot meet these European requirements.

In the United States a broad array of sector-specific privacy and data protection laws create even greater obstacles to institutional use of email services that rely on data mining. Federal statutes such as FERPA in education and HIPAA in health care, as well as the FBI’s CJIS regulations in law enforcement, unambiguously exclude the intrusive, commercially motivated deep analysis of user content and behavior that lies at the heart of all versions of Gmail.

The future of Gmail data mining and the need for transparency

Despite these concerns — or perhaps because of them — Google continues to wage a legal battle in Judge Koh’s San Jose courtroom to prevent the media and the public from learning the full extent of user profiling in Gmail. Its effort to redact all revealing details from the public record of this case is in striking contrast to the justly famous mission statement posted on its web site:

“Google’s mission is to organize the world’s information and make it universally accessible and useful.”

It is not hard to understand why Google might be reluctant to spell out exactly how Content OneBox and Gmail’s many other data mining procedures do their work. Machine learning is a rapidly evolving discipline where new and improved algorithms are discovered every day, while both the volume and the types of data that Google can subject to its algorithms are constantly expanding. Why concede to regulators the right to supervise and perhaps place inconvenient restrictions on your data mining practices when you expect the scope and power of those practices to grow exponentially for years to come? One of Google’s leading data scientists, Ray Kurzweil, famously predicts to anyone who will listen that software algorithms will surpass human intelligence within another decade or two at the most. Google’s leaders may feel that it makes more sense to stall for time by distracting regulators with obsolete technologies like cookies or endless debates over the meaning of privacy policies that few users will ever read.

Yet at a time when Google is calling on governments around the world to disclose and limit their surveillance practices, its relentless secrecy about its own highly intrusive user profiling is especially paradoxical. U.S. intelligence agencies are today subject to far more scrutiny from the courts, Congress and the media than Google itself. But what is good for the goose should surely be good for the gander. Perhaps it is time for Google to embrace the same transparency about data mining it wishes to see in others.