Google Mines Gmail for Big Data Gold

Google’s new Inbox app isn’t your father’s email

5 min readNov 12, 2014

[Note: for a more detailed look at how data mining works in Gmail, see my previous piece on Medium.]

Email is in the midst of a radical transformation. What used to be simple unstructured communication is evolving into something very different – highly structured data that can be mined and exploited on a vast scale by advanced machine learning algorithms. The driving force behind this transformation is the ad-supported free email business model first launched by Hotmail in the 1990s and since perfected by Gmail. Google is now launching an entirely new email client, Inbox, which makes the transformation of email into big data visible to the naked eye.

It is no exaggeration to say that email is now big data. Gmail, the undisputed kingpin in this category, boasts 750 million users worldwide, and — according to Google Senior VP Sundar Pichai — is well on its way to becoming a billion user product. If we assume that a typical user receives just 5 to 10 non-spam messages per day, this means Gmail handles 2 trillion or more inbound messages per year. This rushing torrent of data is not simply dumped into millions of inboxes to sit passively until users take notice. Rather, it undergoes an extraordinary sequence of distinct data mining operations before it ever sees the light of a user inbox. Previously shrouded in secrecy, the existence of these processes has recently been revealed in a landmark class action lawsuit against Gmail. Their bizarre and colorful names are worthy of a spy thriller: Content OneBox, ICEbox, Nemo, Moonshine, Monarch, Borgmon, Starbox, Colossus, Panopticon, HappyHour, and Tigress, among others.

We don’t know exactly what every piece in this vast data mining machine does. But we know that its overarching purpose is to extract information from email content in order to build persistent user profiles and let Google target its ads with ever greater precision. Interestingly, the court documents reveal that mining operations which Google often claims are indissociable – such as scanning email for malware or spam on the one hand, and scanning for user profiling or ad targeting on the other – are in fact completely separate processes carried out at different times in the lifecycle of a Gmail message. According to one remarkable exchange between Google managers published in the court record, Gmail’s algorithms classify users into literally “millions of buckets”. In the same exchange, which dates from 2009, a Gmail product manager explains that “the key is to filter the daily routine communications and extract the more commercial user interests”.

Five years later Google continues to improve its ability to extract commercial meaning from your email. Its new Inbox app (for Android, iPhone and the Chrome browser) takes the concept of email categories introduced with 2013’s tabbed inbox to the next level. Instead of staring at a scrolling list of email subject lines, with Inbox you see bundles of related information nuggets that have been extracted from messages and grouped into convenient categories. For example, your bank statements will appear together, as will your Amazon purchase confirmations, UPS delivery notifications, travel and restaurant reservations, and so forth. Google considers that you may often never even need to see the old-fashioned email messages in which these nuggets arrived. It conveniently surfaces actionable items for you such as a button to confirm a reservation, check in for a flight or track a package, or maybe a map to show you how to get from the airport to your hotel.

Of course Inbox is only an interface. It doesn’t do data mining itself, but merely taps the underlying algorithms of Gmail. Nevertheless the implications of this new kind of email are astounding. Google can now peer into the stream of billions of merchant transactions flowing into Gmail inboxes, and understand what it all means in real-time. Armed with this knowledge, which its controversial privacy policy allows it to combine with everything else it knows about users – their search queries, the web pages they browse, the YouTube videos they watch, where they live, how much they earn, and even whether they have children – Google can predict the buying behavior of the world’s consumers. And it can do so with a precision and speed, and on a scale, never before dreamed of.

Should commercial firms have such power, which perhaps exceeds that of even the world’s biggest intelligence agencies? If so, can this power be made compatible with users’ legitimate expectation that the firms will admit openly what they are doing and ask permission before they do it? These questions cannot be avoided. They will inevitably be raised by privacy advocates and regulators around the world as they take the measure of Gmail’s new capabilities. Innovation is not only good, but arguably unstoppable – I’ve even installed Inbox on my own iPhone to try it out. However, there is no reason – as too many in the tech community unwisely assume – that it’s OK for innovation to trample the rights of users. A happy medium must be found.

No one is saying that business models based on data mining and user profiling should be banned. But despite the insistence of tech industry insiders that “everyone understands how Gmail works”, the reality is that most consumers don’t. People have a right to be told honestly — in simple, direct language that they will not misunderstand — that everything they say or do online will be watched and tracked. Google’s notorious privacy policy is a particularly egregious example of language that has been deliberately designed to obfuscate what is really happening:

“We collect information to provide better services to all of our users – from figuring out basic stuff like which language you speak, to more complex things like which ads you’ll find most useful or the people who matter most to you online.”

No, Google, this is not an accurate explanation of why you collect information. Figuring out what language we speak or who matters most to us online are entirely secondary to your primary aim of profiling and tracking us so that you can hit us with the ads most likely to generate revenue for you. That is in my view a perfectly legitimate business model. But you need to come right out and say it in a way that everyone understands, and you need to give people a chance opt-out if they want to.

Google Mines Gmail for Big Data Gold

Google’s new Inbox app isn’t your father’s email

Written by Jeff Gould