Allo there — What big data and deep learning mean for privacy

Vijay Pandurangan
Vijay Pandurangan
Published in
7 min readOct 3, 2016

I recently joined Benchmark as an Entrepreneur in Residence. My current areas of interest include machine learning, privacy, distributed systems, and telepresence. If you’re interested in one of these areas (or something else interesting) and want to chat, drop me a line!

A few weeks ago, Google released Allo, a new messaging application with a built-in assistant. When Allo was pre-announced, the team pledged not to store conversations by default, a key privacy feature. The subsequent decision to backtrack on this pledge was quickly criticized by a number of privacy advocates including Thomas Drake and Edward Snowden, causing major PR headaches during its launch. Google, the industry leader in machine learning (ML) technology, values these recorded conversations highly because of its belief that they will help feed data-hungry ML algorithms that power its AI agents, ad services, and many other products.

The power of data to optimize businesses and governments is nothing new — even in the 19th century this was widespread. In the past, our ability to process truly vast amounts of data and uncover subtle links was limited, the types of data being created and collected were generally not as intrusive, and simple encryption techniques allowed us to retain privacy where required. Recently, the availability of massive quantities of very precise data about all aspects of our lives, and techniques to collect and effectively analyse them has far outstripped our ability to maintain privacy. As companies continue to collect even more data about us, the lack of a clean solution allowing modern ML techniques to co-exist with individual data privacy will present increasing challenges to businesses in the area. While research in differential privacy may point the way to a solution, more work is required.

Historical big (medium?) data

Almost overnight, the advent of the electric telegraph in the 1800s reduced communication latency from months to seconds. Armed with the ability to instantly get information from and send instructions to agents in disparate locations, businesses were now able to operate far more efficiently. For instance, multinationals now had access to real-time global pricing information: a manager sitting in London could examine the prices in markets in New York, London, and Bombay and (with a rough estimate of shipping costs), buy goods from the optimal place.

The impact on government was also profound. The British Empire built a fully-owned network of undersea cables between all their colonies. This gave it large advantages: bureaucrats in London were able to communicate with their counterparts in other Dominions in secrecy, while others who wanted instant communication often had to use British-owned links. Spying on these links (essentially a precursor to NSA’s tapping of Internet traffic) provided the British with large troves of data which, when analysed, yielded invaluable intelligence.

At the time, it was well-known that messages could be intercepted at will, so most businesses and individuals who cared about privacy quickly adopted encryption/encoding techniques that were used at both ends to ensure privacy. At various times, telegraph operators in Europe attempted to prevent the use of ciphers (at one point they even gave out a list of acceptable words), but these efforts eventually failed.

In the latter part of the 20th century, retail businesses began analyzing a small subset of data from the “edges” of the retail network — i.e. consumer purchase data — to increase efficiency. Circuit City and WalMart used a centralized database to track stock, make predictions, avoid shortages, and make loans to customers. Tesco in the UK and 7-Eleven in Japan made great use of these data to determine which branches should stock which goods. The collection and analysis of personally identifiable behavioural data (though limited to interactions in individual stores) began slightly compromising individual privacy.

Current trends

In the last fifteen years, advances in the processing power of GPUs, deep learning algorithms, data collection and communication have allowed computers to “learn” all kinds of tasks, from classification (e.g., what breed of dog is in this photo) and regression, to more creative tasks such as music composition and driving vehicles. While deep learned models now outperform human experts on many tasks, the data-hungry algorithms rely on extremely large corpora of data to be effective. When models are trained on detailed data about individuals (e.g., web browsing history, location information, etc.), privacy considerations become important — both because of the scale of data held in one place as well as the potential for leaks from the models.

Continued improvements in deep learning hardware and algorithms are increasing both the amount of data we can process and the information we can extract from them. This trend disproportionately increases the value of large datasets which, in turn, increases the demand for faster and more efficient computation. Since state-of-the-art deep learning technology is widely available, large, deeply personal datasets, and top technical talent have become the primary strategic advantage in this space. The ensuing “data arms race” is one of the main reasons that many of the largest technology companies (such as Google, Facebook, Uber, and Amazon) accumulate massive troves of data which they jealously guard.

In some domains, large datasets with strict privacy or secrecy requirements are commonplace. The European Union takes a much more stringent approach towards data privacy: they require explicit consent to any data sharing and collection, insist on a right to be forgotten and enforce other restrictions. Although the United States has comparatively lax regulations outside the healthcare and educational domains, they’re likely to be tightened as the largest companies accumulate increasing amounts of data about already skittish consumers.

In response to a freedom of information request, NYC released a record of 173 million trips taken in yellow cabs in 2013. Each record consisted of starting location and time, ending location and time, as well as an (improperly) anonymised plate and driver’s licence number. (An article I wrote in 2014 has a better description of the data and why the numbers were improperly anonymized). The release of these data combined with de-anonymization techniques allowed enterprising journalists to retroactively discover all kinds of private information about celebrities’ trips (essentially, they read the plate number in a photo of someone exiting a taxi and identified the trip in the database).

These data aren’t just useful for knowing what club Justin Bieber attended. They can also be used for immense social good. Governments can use this information to improve traffic throughput, plan new transit lines, or locate emergency services. (Check out Sidewalk Labs in NYC if you’re interested in technology applied to cities). Let’s say we wanted to predict the time it would take to travel between two points in NYC by taxi so that we could recommend the fastest mode of transport. We could train an ML model with these data, a historical feed of weather conditions, an event database, and potentially even logs from emergency services. We would also want to protect privacy, preventing the kind of analysis described above. In 2015, Fredrik et al. showed that it was indeed possible to invert and extract data from black box models if appropriate countermeasures weren’t taken.

Privacy and Deep Learning

As datasets contain more personal information, privacy can suffer. What steps can we take to safeguard user privacy? A technique known as differential privacy allows the publication of a slightly altered dataset with guarantees that aggregate statistical properties are maintained without leaking any new information about any individual whose data is contained in the original corpus. Granting access to these altered datasets will have much more limited privacy implications (in fact, NYC is holding a hackathon to ensure their new anonymisation process is sound).

Recently, differential privacy research has been extended to deep learning: in June 2016, Abadi et al. published research describing the use of differential privacy in conjunction with deep learning. The authors demonstrate and evaluate a method to train Convolutional Neural Networks (CNNs) on a centralized corpus of data while adhering to certain privacy guarantees. While this results in a model with reduced accuracy, it offers substantially better privacy and accuracy than previous work in the area.

Specifically, the kind of differential privacy advocated by the paper promises that, by observing Bradley Cooper emerging from a taxi, we would be unable to make any definitive statements about where he originated, even with unfettered access to the ML model; the contribution of his trip record to the model would make him no worse off from a privacy standpoint. To be clear, while we may not learn anything from Bradley’s own trip, aggregate data might leak information about him. For instance, if an Islanders game finished 20 minutes before Bradley got out of a taxi at a location our model predicts would have been 20 minutes from Barclay’s Center, we might conclude that he arrived from that location (especially if we know a priori that he’s a hockey fan). Though the model is working as expected, we can see that there is a philosophical debate about whether releasing aggregate statistics is permissible. Of course, this means that we must trust one organization to hold all these data in one place.

In cases where this large trove of sensitive data (imagine health records or school records instead of taxi data) are unacceptable and storing datasets treated with differential privacy is not useful, we might be able to train a model over a number of different closely-held sets of data; previous work by Shorki et al. details one approach.

Conclusions

Companies and governments are accumulating increasingly large datasets about us. Using ML, they’re profiting from the results: optimized ads algorithms, AI agents, self-driving cars, voice recognition, and a number of other innovations wouldn’t be possible without these data. Despite promising new research, there are (to my knowledge) currently no production-ready deep learning systems that offer satisfactory performance while also making useful privacy guarantees. Consequently, companies must either prioritize privacy and circumscribe the benefits of state-of-the-art AI, or — as Google did with Allo — ignore privacy concerns and forge ahead with large-scale data collection and analysis.

Bridging the privacy/data gap is a very important area for future work: a productionised, effective system that can operate at scale while making privacy guarantees will finally allow us to apply the benefits of modern deep learning techniques to privacy-sensitive datasets.

Thanks to Russell Power and Frank McSherry answering some of my differential privacy questions!

--

--

Vijay Pandurangan
Vijay Pandurangan

EIR @Benchmark. Formerly: Eng Director & NY Eng Site Lead @Twitter. Founder @MitroCo, TL/M @Google. www.vijayp.ca