Glossary of Common Privacy & Technology Terms

Published in

Privacy & Technology

9 min readFeb 12, 2021

As I teach my experiential Privacy & Technology course at Santa Clara Law’s leading privacy program this semester, certain privacy and technology terms and concepts come up. I previously wrote about the course here, here, and here.

One of my course goals is to introduce my students to common nomenclature intersecting privacy and technology, a step to bridging the legal-technical gap in this cross-functional space.

Thus, we’ve created this glossary, which we will be periodically updating throughout the semester.

Anonymization. Anonymization is the process of rendering data anonymous in such a way that the data subject is not or no longer identifiable. Once personal data has been fully anonymized, it is no longer personal data, and subsequent uses of the data are no longer regulated by the GDPR. If data anonymization fails or is not possible, then personal data must be treated as personal data.

Anonymized data. Anonymized data is data that does not identify or no longer identifies a data subject.

Data Privacy. Data privacy is the subset of privacy focusing on the ability to make decisions about one’s personal data.

Differential Privacy. Differential privacy provides a mathematically rigorous definition of privacy that quantifies risk. Because this definition is hard to comprehend, the following example might be more illuminating:

“In the simplest setting, consider an algorithm that analyzes a dataset and computes statistics about it (such as the data’s mean, variance, median, mode, etc.). Such an algorithm is said to be differentially private if by looking at the output, one cannot tell whether any individual’s data was included in the original dataset or not. In other words, the guarantee of a differentially private algorithm is that its behavior hardly changes when a single individual joins or leaves the dataset — anything the algorithm might output on a database containing some individual’s information is almost as likely to have come from a database without that individual’s information. Most notably, this guarantee holds for any individual and any dataset. Therefore, regardless of how eccentric any single individual’s details are, and regardless of the details of anyone else in the database, the guarantee of differential privacy still holds. This gives a formal guarantee that individual-level information about participants in the database is not leaked.”

Differential privacy’s main premise is that introducing carefully calibrated “noise” — random and meaningless data — can mask a user’s personal data. Differential privacy requires a “privacy budget,” which is a quantitative measure of how much an individual privacy risk may increase due to the individual data’s inclusion in the input.

Differential privacy addresses privacy during collection and disclosure. Its goal is to limit the amount of personal information that is leaked from a database by releasing aggregate computing results.

Differential privacy can be central, local, or only very recently, hybrid. Central differential privacy is at the server level, whereas local differential privacy is at the device- or user-level. Local differential privacy randomizes data before sending it from the device to the server so the server doesn’t receive raw data.

Until very recently, the choice was binary: either accept a much larger level of noise (local DP), or collect raw data (central DP). But hybrid differential privacy implementations are developing.

Edge Computing. Edge computing involves distributing computing intelligence across an entire network, closer to where things and people produce or consume information (the “edge”), instead of centralizing computing in the cloud. While there’s a misconception that edge computing will replace the cloud, it’s important to note that they function in conjunction with the cloud.

Edge computing is still new and innovators are still determining how to implement them. That said, it appears to be well suited for the Internet of Things (IoT). Their current uses include smart speakers like Amazon’s Echo & Google’s Home.

Edge computing limitations include increasing attack vectors and requiring more local hardware.

Federated analysis. Federated analytics allows data scientists to generate analytical insight from the combined information in distributed datasets without requiring all the data to move to a central location, and while minimizing the amount of data movement in the sharing of intermediate results. Instead of taking all the data from user devices to the centralized server for analysis, the algorithm is sent to the data to learn from it. When it finishes, a summary of the new knowledge gets sent back to the company’s server — the data itself never leaves the phone.

Pharmaceutical companies use federated learning to maintain privacy during drug trials. Banks also use federated learning to perform analytics on data securely.

Federated learning is still in early stages. It requires more power and memory to train models, increases latency, and slows down the learning process.

Generative Adversarial Network. A Generative Adversarial Network is a deep neural network framework capable of learning from a set of training data and generating new data with the same characteristics as the training data. For example, a generative adversarial network trained on photographs of human faces can generate realistic-looking faces which are entirely fictitious.

Generative adversarial networks consist of two neural networks, the generator and the discriminator, which compete against each other. The generator is trained to produce fake data, and the discriminator is trained to distinguish the generator’s fake data from real examples. If the generator produces fake data that the discriminator can easily recognize as implausible, such as an image that is clearly not a face, the generator is penalized. Over time, the generator learns to generate more plausible examples.

Generative adversarial networks have been used in medical imaging, bioinformation, cybersecurity, editing and translating photos.

Disadvantages include unreliable evaluation metrics, unstable training, no formal density estimation, and no straightforward inversion.

Homomorphic Encryption. Homomorphic encryption is a cryptographic technique that allows computation on encrypted data and generates an encrypted result which, when decrypted, matches the result of the same operations performed on the data before encryption. In a regular environment, encrypted data computations will result in random decrypted outputs. In contrast, the homomorphic encryption decrypted results are the same results from computations conducted without encryption.

Homomorphic encryption’s use is widely cited in healthcare and finance, which involve strict privacy obligations. It could potentially be used by data storage providers to enable their customers to perform computations on their encrypted data.

That said, homomorphic encryption has several drawbacks: it slows down computations, requires significant memory, and has questionable overall encryption strength. In other words, it raises scalability issues that would need to be addressed for widespread adoption.

K-Anonymization. K-anonymization is a data generalization technique that ensures indirect identifiers match a specific number of other records, making it difficult to identify individuals within a dataset (the total number of matching records is referred to as “k,” and hence the name). For example, in data that’s been k-anonymized, if k is set to 10 and where indirect identifiers include race and age, we would only see at least 10 records for each combination of race and age. The higher k is set, the harder it will be to use indirect identifiers to find the record of any specific individual.

Local Processing. Local processing involves shifting computing intelligence from the cloud to the device level. There’s a misconception that local processing will replace the cloud, it’s important to note that it functions in conjunction with the cloud.

Edge computing and local processing are still new and innovators are still determining how to implement them. That said, they appear to be well suited for the Internet of Things (IoT). Their current uses include smart speakers like Amazon’s Echo & Google’s Home.

Privacy Engineering. Privacy engineering is a discrete discipline or field of inquiry and innovation, using engineering principles and processes to build controls and measure into processes, systems, components, and products that enable the authorized, fair, and legitimate processing of personal data.

Privacy engineering is also the gathering and application of privacy requirements with the same primacy as other traditional feature or process requirements and then incorporating, prioritizing, and addressing them at each stage of the development lifecycle, whether for a process, project, product, system, app, or other.

Pseudonymization. Pseudonymization is the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information.

Secure multi-party computation. Secure multi-party computation (SMPC) is a subfield of cryptography concerned with the problem of different parties who wish to conduct joint analysis on combined data. SMPC allows computation or analysis on combined data without the different parties revealing their own input data. The parties don’t receive information about each other’s inputs, except for the output, which is public to all parties.

SMPC may be used when two or more parties want to carry out analyses on their combined data but, for legal or other reasons, they cannot share data with one another. SMPC can, for example, be leveraged in smart meters. Individual power consumption data can be used to extract detailed information on individuals’ activities, such as whether they are at home, which appliances they’re using, etc. SMPC can be used to address user privacy without sacrificing smart meter load management and verifiable billing interests.

SMPC drawbacks include requiring new architecture/methods for computing, complex functionality, need for additional infrastructure, slower run times, and higher communication costs.

Small Data. By definition, Small Data systems involve far less data than their Big Data counterparts. Small Data systems are easily attainable, easier/more efficient than big data, can be used for remote monitoring, allows us to predict/detect/address issues more readily.

Companies can build smarter, more efficient, and privacy-friendly AI leveraging small data sets that aren’t being used.

Small data sets are still dependent on the human component; it’s necessary to have the “right” people/teams figuring out how to leverage the small data sets.

Synthetic Data. Synthetic data is data that is artificially created instead of being generated from the world. It is generated by applying a sampling technique to real-world data or by creating simulation scenarios where models and processes interact to create completely new data.

Synthetic data has several common use cases including self-driving vehicles, security, robotics, fraud protection, and healthcare.

That said, synthetic data has several limitations. It often lacks outliers, which occur in data naturally. While often dropped from training datasets, their existence could be necessary for reliable machine learning models. Synthetic data also can be dependent on the quality of the input data and is therefore subject to bias. It requires quality control by checking against human-annotated data or otherwise authentic data. Like actual data, synthetic data is not free and takes time and effort to create. It may not be accepted as valid by users unfamiliar with its benefits.

Trusted Execution Environments. Trusted Execution Environments (or TEEs) provide secure computation capabilities through a combination of special-purpose hardware and software built to use such hardware. The hardware (e.g. a chipset) allows a process to run on a processor while keeping its memory invisible to any process on the processor, including the operating system or other code.

The computation in TEEs are not done on encrypted data; instead, the hardware secures the data through enclaves, which protect memory space from access. An attestation that the enclave is genuine and that the code running the enclave is expected can be issued to a process that needs to trust the enclave.

TEEs are well-suited for verification, biometric authentication, mobile wallets, digital payments, and copyrighted works like books, movies, and music.

TEEs limitations include exploitable attack surfaces and a lack of industry consensus about the most secure or efficient way to create TEEs, with various hardware manufacturers creating fundamentally different implementations.

Zero Knowledge Proofs. Zero Knowledge Proofs (ZKPs) is a cryptographic method that allows one party to prove (the prover) to another (the verifier) that they know a statement to be true without sharing additional information. The notion of ‘zero knowledge’ was first proposed by MIT researchers, Shafi Goldwasser, Silvio Micali and Charles Rackoff, in the 1980s. They were working on problems related to theoretical systems where a prover exchanges messages with a verifier to convince the verifier that some mathematical statement is true.

Zero Knowledge Proofs have three salient properties: completeness, soundness, and zero knowledge. Completeness means that if the statement is true and both prover and verifier follow the protocol, the verifier will accept the statement as true. Soundness means that if the statement is false, the verifier will reject the statement as false. Zero knowledge means that if the statement is true and the prover follows the protocol, the verifier will not learn any confidential information aside from the truthfulness of the verified statement.

ZKP can be used for payments, digital identities, and internet infrastructures.

ZKP drawbacks include a lack of global standards and scalability challenges.

Glossary of Common Privacy & Technology Terms

Written by lourdes.turrecha