The CCPA Challenge: Hunting For Inferred Data

Published in

Towards Application Data Monitoring

3 min readJun 23, 2020

Power of inference

In a recent poll, we asked people to tell us whether they thought IP addresses should be considered personal information. 40% of respondents strongly believed they should, but 60% did not strongly believe that IP addresses are personal information. It’s ironic that while we’re writing this, there are reports of class action lawsuits related to users’ online activity being tracked even while their browsers are on “private” mode. The reality is that IP addresses, or a sequence of browser clicks, while appearing banal, can reveal a lot about an individual.

Most consumers believe that they are engaging in a contract to exchange their personal data for value delivered by the apps and services they use everyday. However, very few understand what they are actually providing and the implications therefrom. If you provide your date of birth and zip code, what harm could that really do?

The literature is rife with examples of how seemingly innocuous bits of information can be predictors of identity or circumstance. Inferring pregnancy from unscented lotion purchases or being able to identify ~90% of the US population with three relatively benign data elements is old news at this point. What is newer is the slew of privacy protections and related missteps that could disrupt the perceived value exchange equation if mishandled.

Protecting inferences

Europe’s GDPR turns two this year and California’s CCPA is just in its first year. Where the GDPR stopped short of declaring inferences as part of the scope of personal information, CCPA very clearly included them. CCPA clearly defines inference as “the derivation of information, data, assumptions, or conclusions from facts, evidence, or another source of information or data,” putting your predicted Netflix match for a show, no matter how banal, in a protected category.

While there is a robust discussion about data anonymization and re-identification risks and techniques, a key question remains for most enterprises dealing with large amounts of data. How do you figure out where your inferred data is and where it is generated? Do you ask every engineer and data scientist to tell you? If yes, how often do you ask them to update you? What is the organizational cost and cash outlay to achieve this and what is the value you actually get? How do you assess/measure accuracy? Where does responsibility lie if the powers that be deem your efforts insufficient?

Identifying inferred data early

While it is possible to crawl your many data stores and identify most types of personal data using machine learning algorithms, the challenge becomes more complex when dealing with inferences.

One can imagine inferring preferred products from basic personal information. It’s also easy to imagine making second order inferences based on customer actions when presented with product recommendations. As organizations become increasingly sophisticated in their analysis and use of customer data, the chain of inferences can become deeply convoluted and broad. A specific nuance to consider, for example, is if your data store crawling surfaces a bunch of stores with hashed inferences, how do you practically figure out what they are or whether they should be a priority?

This is where a different set of approaches is needed. Classifying and tagging data as it enters your system and being able to trace its evolution and refinement as it is handled by different services and teams is a key step in driving to greater clarity. It’s a bit like a radioactive dye test to understand what is happening in the body as opposed to invasive dissection. And the good news is that techniques like Application Data Monitoring exist to help privacy engineers get their arms around where the PII is in their systems and also track the lineage of inferred data fields through the services that generate them, back to the source, visually and without ambiguity.

The goal is to treat all data, collected and inferred, as a critical priority whose safe handling is ensured by early detection and classification.

—

Thanks to Nishant Bhajaria for his contributions to this post. Image credit: undraw.co

The CCPA Challenge: Hunting For Inferred Data

Power of inference

Protecting inferences

Identifying inferred data early

Written by Arjun Dutt