The Evolution of “Privacy-Tech”: Data Collaboration

6 min readMay 17, 2023

This is the second of a four-part series on “Privacy-tech”. You can read the introductory section here.

Secure Data Collection

Most applications today use encryption to securely transmit data from our devices to central servers. In end-to-end encryption, the data is decrypted only by the end-users, and not the central server. For instance, Whatsapp chats are readable only by those involved in the conversation, and not by Whatsapp’s central servers. While encryption ensures data is securely collected and transmitted, it doesn’t allow central servers to perform any computation on it. For messaging this may be preferred, but if you wanted to understand which emoji is being used the most for a new feature, you wouldn’t be able to perform that analysis, even in aggregate. This is where data vaults come in.

On a related note, with the recent advances in quantum computing, legacy encryption technologies are at risk. Currently there are competitions to develop quantum-computing safe encryption algorithms to more safely store data, with efforts being led by NIST (National Institute of Standards and Technology) in the US. When the winning algorithm is selected in 2024, mainstream commercial adoption can be expected immediately after¹. This new wave of post-quantum cryptographic techniques is expected to be computationally expensive and hardware advances may also be required for mainstream adoption.

Irrespective of the type of encryption, if data breaches do occur, data privacy is violated. This is where private data collection comes in.

Private Data Collection

Today, most websites and applications capture user behavior with extreme granularity. Users are assured that analytics performed on their data will have personally identifiable information (PII) anonymized, and hence secure. However, hackers have in the past, been able to de-anonymize such datasets by matching data with non-anonymized datasets, such as what they did with Netflix Prize data².

A popular method to overcome this is through use of Federated learning, in which models train on private data on client devices³. Only the outputs of the model are sent to the server, thereby limiting data shared. However, often AI models tend to encode more information than necessary, and can inadvertently memorize private or identifiable information. By querying the model, you could determine the presence of a record⁴.

Apple collects information from its devices, on popularly used emojis, websites that crash Safari and other similar data points using differential privacy. The largest perhaps roll-out of differential privacy was a collaboration between Apple and Google in contact-tracing COVID, in a completely privacy preserving manner⁵. Apple does so by utilizing a technique called Differential Privacy.

Differential privacy (DP) is a method that allows entities to systematically insert noise into collected data that make aggregate calculations work well, while individual entries do not make sense by themselves⁶. Although still gaining adoption, differential privacy is a robust method of ensuring data privacy not just at source, but through computations too. Machine learning performed on differentially private data will continue to be differentially private. Many proponents of differential privacy posit that it can make confidential data widely available and obviate the need for clean-rooms or data curators⁷. However, data utility does somewhat decrease with differential privacy and the goal of current research is in being able to reduce that to minimal levels.

Data Vaults and Clean-Rooms

Data vaults create a safe space for data to be stored and accessed on an as-and-when-required basis. Data vaults help ensure that the right data is selectively shared, commonly referred to as the “bundling” problem. Popular use-cases to implement data vaults involve securely storing PII, payments data (PCI data) and healthcare data (HIPAA compliance)

A data clean-room is similar to a data vault, used in the context when multiple parties decide to collaborate. In a data clean-room, computations are performed without revealing or learning about the data of each other. Homomorphic encryption (FHE), Secure multi-party computation (MPC) and Trusted Execution Environments (TEE) are a few technologies that have garnered interest in this space.

Data clean-rooms are being set up by large brands for sharing marketing data between partners. Advertising players such as Facebook, Google and Amazon also share data with their clients through clean-rooms. Gartner predicts that by 2025, 50% of large organizations will default to using Privacy-Enhancing Technologies (PET) to store their data, so this is a large market⁷.

Homomorphic Encryption and Secure Multi-Party Computation (MPC) are cryptographic methods to perform computations without revealing data contents. Trusted Execution Environments (TEEs) are CPU-encrypted isolated private enclaves inside the memory, used for protecting data in use at the hardware level.

Clean-rooms and Vaults are dependent on trust with the entity that sets up the clean-room and vaults. This works well to set up an ecosystem for a brand performing marketing, however inadequate if it is a large ecosystem of hundreds or thousands of entities. This is where decentralization and trustless data platforms come in.

Trustless Data Platform

Data clean-rooms allow multiple parties (or stakeholders in the same company) to collaborate on data in a privacy-preserving manner. However, the rule of Fundamental Law of Information Recovery states, “overly accurate answers to too many questions will destroy privacy in a spectacular way”. Privacy in a clean-room is dependent on the governance rules set by an administrator who will enforce governance on queries. For example, you might want to share your medical records with a cancer screening provider just to run a cancer test, without them being able to run any other computation that will for instance, help determine other diseases you might have, or worse, reveal your identity based on a rare condition.

Decentralization stores data and performs computations across multiple nodes, thereby allowing for a trustless architecture. Decentralization also reinforces security of the platform — for an entity to breach data, it will need to gain access to multiple nodes at the same time. Blockchain provides the foundation for decentralizing computation, however data stored on popular chains such as Ethereum can’t be encrypted or private.

Storing private data on blockchain poses its own challenges — it is difficult to adapt technologies such as homomorphic encryption to run in a decentralized manner. Multi-party computation and TEEs are more suited to work on decentralized systems, and there are projects in various stages building this future. Large organization such as financial institutions are still piloting with blockchain so the days of privacy-preserving blockchain are still ahead of us.

If computations and transactions were happening on a decentralized platform, verifiability is another important consideration. We would need to ensure that (A) inputs originated from entities we believe they are and that (B) computations were performed correctly. To solve for the former, enterprises and startups alike are building a “Sovereign Self Identity” future to allow for issuing and certifying entities to issue “Verifiable Credentials” that cryptographically prove the validity of one’s identity. For the latter, zero-knowledge proofs, a cryptographic technique, helps validate that the computation was performed correctly, without revealing the data that it was run on.

The implications of zero-knowledge proofs (or ZKPs) are immense for blockchain. In a decentralized network where smart contracts are executed by multiple nodes to certify accuracy, zero-knowledge proofs improve speed and scalability. This promise of scalability of Ethereum has attracted a lot of venture capital into building new algorithms and in hardware acceleration.

In the next section we will talk about some emerging use-cases for these technologies.

References

PitchBook Q2 2022: Hashing out Future of Encryption Algorithms
This is referred to as a “linkage attack”. Eg. the Netflix attack, health data.
This great comic explains federated learning as Google uses it.
Differential Privacy using PyDP (openmined.org)
ENPA_White_Paper.pdf (cdn-apple.com)
Learning with Privacy at Scale — Apple Machine Learning Research
privacybook.pdf (upenn.edu)
Data Clean Rooms: Powering the future of retail media | InfoSum Blog