By Côme Dartiguenave and Charlotte Labauge, Associates at Partech
Ariadne’s thread is not only a famous Greek myth but also a logical thinking process that people use when faced with complex problems that require the use of different paths and iterations to find the solution, such as a maze or a puzzle for instance.
The processing of exponentially growing data exposes our society to the problem of protecting privacy while being able to extract more valuable insights from the data. Finding the right balance will require collective thinking, iterations, and technology.
This article is meant to qualify the risks related to data privacy and share some of the promising solutions to this difficult problem, as a modest contribution to this new Ariadne’s thread :)
Data is the heart of the new economy
It is not news that the amount of data generated has grown exponentially over the past few years, with 2.5 quintillion bytes generated a day.
This unprecedented volume of data, combined with the development of more sophisticated analytical tools and artificial intelligence, has allowed businesses, organizations, and startups alike to adopt an increasingly data-driven approach. Whichever industry you work in, or whatever your job, you have almost certainly heard a story about how data is transforming our lives and the way organizations operate. Data is used in countless ways and by vast numbers of people: by scientists to find new cures; by companies to boost revenues, optimize costs, and build personalized products; and by governments for public service and security purposes.
A large chunk of the data collected today is not necessarily sensitive, such as web data. This is data that is public facing (e.g. data on what companies are selling, public government databases, weather forecasts, etc). Leveraging this data can be extremely valuable for businesses, research programs and does not create a major risk for consumers.
However, many businesses and organizations also collect vast amounts of personal or transactional data. Personal data is any information relating to an individual. Not only does it include information that directly identifies people (including name, national security number, email address, home address, passport number, credit card number, and driver’s license number) but also information which, combined with other personal information, can identify an individual (including country, state, postcode, age, gender, name of the school one attended, and workplace).
Many of the social media or e-commerce companies (Facebook, Youtube, Instagram, Amazon, etc) use personal information to recommend content and advertising you might appreciate based on your own activity or what other people similar to you like. Also, personal data is not always used directly by the company that gathers it. Quite often, personal data is monetized by being aggregated, anonymized, and then sold to other companies, mostly for advertising and competitive research purposes.
Sensitive personal data is thus inherently and increasingly at risk
Personal data is very valuable for hackers or people with malicious intents who can leverage it to get ransoms, to access financial credentials, or damage reputations, etc.
There have been massive data breaches in recent years resulting in large amounts of personal data being leaked. One of the most recent data leaks was reported by Marriott in November 2018. An estimated 383 million guests had personal data exposed. In 2016, Yahoo reported the largest breach in history with 3 billion accounts stolen back in 2013. All in all, the Privacy Rights Clearinghouse — a nonprofit dedicated to raise consumers’ awareness on privacy matters and protect privacy rights — reports 11.6bn records have been breached from 8,804 data breaches since 2005. That is an average of 1.7 data breaches and 2.2 million records exposed per day.
Sometimes personal data is leaked unintentionally by the companies themselves. Despite their best efforts, companies are not bullet-proof and lose control of some data. In April 2019, security researchers reported 540m Facebook accounts data had been exposed publicly on Amazon’s servers. These breaches came from two third-party developers, Cultura Colectiva and At The Pool. Last year, Google declared it had exposed the personal data of 500,000 Google+ accounts due to a software bug, leading to its official shut down.
The first victims of these breaches are individuals, who face identity theft, financial fraud or other serious consequences. Awareness is rising among individuals: ‘Have I Been Pwned’, a website enabling people to check if their personal data has been compromised by data breaches, was created by Troy Hunt in 2013 and reached 2m subscribers in June 2018.
The second victims are the companies who suffered the data breaches, as they usually face serious consequences.
The financial impact can be significant and includes revenue loss, money stolen, fines… According to IBM, the global average cost of a data breach in 2018 was up 6.4% over 2017 to $3.86 million. The average cost for each lost or stolen record containing sensitive and confidential information also increased by 4.8% year to $148. More specifically when it comes to fines, GDPR penalties reach a maximum of €20 million (about £17 million) or 4% of global annual turnover.
The reputation damages are also substantial because companies lose peoples’ trust and eventually customers themselves. For instance, in 2015, TalkTalk reported a data breach in which the bank account information of about 15,000 customers were stolen. The company lost 95,000 subscribers, costing it £60 million.
Additionally, a data breach can result in a loss of IP; increased legal and investigation fees; and escalating insurance premiums and public relations damage-control costs. The numbers speak for themselves: Equifax recently announced it spent roughly $1.4bn in data breach costs. The company was hacked in 2017 and compromised the personal information of roughly 150m people.
We strongly believe the answer to these rising data privacy challenges cannot be found without combining clear regulatory guidelines and better technology solutions.
Data privacy is rapidly becoming a priority for regulators
Following this dramatic surge in data breaches, many questions and concerns around data privacy have emerged, resulting in a strong regulatory push. In Europe, the GDPR (General Data Protection Regulation) was officially enforced in May 2018. In the US, California passed the CCPA (California Consumer Privacy Act) in June 2018, it will be enforced in January 2020.
Although the GDPR is the most comprehensive, they both are working towards improving personal data protection. The objectives of these laws include:
- Companies that collect, process and store personal data should be transparent about the type of data they collect, how they use it and how / with whom they share it.
- People should have control over their personal data. This translates into several rights, including the right to know, to access, to opt-out or to say no, to deletion.
- Relevant and solid security measures to protect personal data should be implemented, and companies should be able to document and prove it.
However, anonymization is not as easy as it seems: improving technologies will be needed to combine privacy protection and data opportunities
Usually, when people think about the protection of personal data, they think about anonymization, also called de-identification. In that sense, they often think of anonymity as simply obfuscating directly identifiable data, such as the user name.
However, academics have long started to highlight the shortcomings of such anonymization techniques. In 2000, Sweeney’s research showed that “87% of individuals in the 1990 U.S. census could be uniquely identified by just three pieces of information: their birth date (day, month, and year), gender and their five-digit postal code”. This example demonstrated that sensitive data can easily be uncovered by correlating a dataset from which personally identifiable data has been removed or masked with another dataset that contains such personally identifiable data. Similarly, a recent study from MIT and other research from Scientific Reports highlighted the risks related to location data, which is highly specific to individuals and can be easily used to re-identify individuals.
Anonymization is not as easy as it may initially seem and most traditional anonymization techniques (eg obfuscating, masking, etc) generally lead to partial anonymity as companies often store and share “grouped” data that can be analyzed and matched against other datasets to be linked back to you (individually). Thus, a better way to think about effective de-identification should be that one cannot uniquely identify an individual within a dataset regardless of what additional information he may have about that individual.
Being able to seamlessly and safely share sensitive data is critical for both private and public organizations but they currently face two major challenges. First, organizations must preserve the privacy of users, including from the potential re-identification attacks mentioned previously. Second, to benefit from the “data opportunity”, organizations must try to maintain the usability and granularity of the data to ensure accurate analytical outcomes.
Today there is an ongoing effort from researchers and companies (eg Dataguise, BigID, Protegrity, Aircloak — to name just a few) to come up with new techniques to maintain privacy in light of the increasingly sophisticated attacks. Back in 2017 at Partech, we invested in Privitar, which provides a privacy engineering software. Its customers, such as the NHS (UK National Health Service) for instance, are able to safely use and share sensitive data by using strong anonymization techniques such as K-Anonymity and Differential privacy.
Of course, there is no one-size-fits-all solution when it comes to privacy, each solution has its pros and cons. As Tyler Elliot Bettilyon recently analyzed in a great post, there is usually a trade-off needed between the quality and granularity of insights you can get from a dataset and the level of anonymization and privacy protection you can ensure. Solving this complex dilemma is the new Ariadne’s thread.
As investors, we are deeply convinced that the good use of data is crucial for our society whether it is to gain insights, build better products, or develop new cures. However, we strongly believe that there is an urgent need for better education regarding the risks associated with heavy data usage and for better tools to actually mitigate those risks. Finding and helping the startups that will provide the thread to get out of the maze is our job. This is why we invested early in Privitar and we are thrilled to welcome Accel on board for the series B to help us support the team in their journey to providing better data privacy.
Whether you are an industry expert, a startup or an investor, we would love to hear your thoughts: please feel free to ping us directly (Côme and Charlotte) if you would like to further discuss this exciting yet challenging topic!
Any comments? Tell us below or join the conversation on Twitter, and stay up to date with upcoming news.