Secure Customer Privacy with Differential Privacy and Versatile Data Kit (VDK)

Understanding Differential Privacy: A Practical Guide to Implementation

Mr. Ånand
Versatile Data Kit
8 min readDec 14, 2023

--

Photo by FLY:D on Unsplash

Introduction

In today’s data-driven world, where corporations gather and leverage vast quantities of personal information, the importance of customer privacy cannot be overstated. Preserving the confidentiality of clients is not solely a legal requirement but also a fundamental ethical obligation.

A method of sharing data that describes patterns in a dataset while hiding personally identifiable information is called differential privacy. For example, a variety of organisations may release statistical or demographic information, but due to differential privacy, it is hard to determine the precise contribution of any given person. The concept is that regardless of whether a certain piece of data was included in the dataset or not, a researcher would still get the same query answer. If a data scientist is unable to identify a specific person using the data, then the system has differential privacy.

In this article, we will learn more about differential privacy and how to safeguard customer privacy by integrating Differential Privacy with Versatile Data Kit (VDK).

Understanding the Need of Differential Privacy

In the modern digital world, we need more data than ever to help make and validate business decisions. Without data no business will thrive in the digital landscape, using data we can train machine learning models, predict user choices, and even serve advertisements. Users are not feeling safe due to access given to businesses of their private data. To gather huge amounts of data from users, we need to provide better privacy guarantees and meet law enforcement standards (e.g. HIPAA, GDPR) to assure users that their data will be safeguarded.

When people try to hide or disguise personally identifiable information (PII), they may unintentionally leave behind other identifying elements in the data. These additional elements are known as quasi-identifiers. So, merely obfuscating PII may not be sufficient to protect privacy if these quasi-identifiers (see image below) can still be used to identify individuals.

Representation of Quasi identifiers

To solve this problem Differential Privacy arrives as the better solution. It is a mathematical framework that offers robust assurances regarding the privacy of an individual’s data, even when integrated into extensive datasets or merged with other data sources. This technique involves introducing random noise to the data before analysis or sharing, significantly increasing the complexity for potential attackers to pinpoint the specific data of an individual. The noise is strategically incorporated to maintain statistical accuracy, ensuring that the outcomes of the analysis remain meaningful and valuable despite the introduced noise.

There are three main actors in differential privacy the curator, owner and analyst of the data. As you can see in the image below, global differential privacy is where we can trust the curator but not the analyst, where in local differential privacy we don’t trust anyone.

Global and Local Differential Privacy

Techniques to Implement Differential Privacy

  • Random Response: It is used when datatype we are trying to obfuscate is Boolean. When a query about a Boolean attribute (such as whether an individual has a given characteristic) is made using this approach, the result is randomised to introduce uncertainty.

Check here or see example below:

class DifferentialPrivateRandomResponse:

def __init__(self, random_response_frequency: int):
self._random_response_frequency = random_response_frequency

def privatize(self, value: bool):
# first coin flip
if np.random.randint(0, self._random_response_frequency) == 0:
# answer truthfully
return value
else:
# answer randomly (second coin flip)
return np.random.randint(0, 2) == 0
  • Unary Encoding: It is used to add noise when we are trying to add privacy in Enum type data types. Unary encoding is a way of representing categorical data in a vector.

Check here or see example below:

def _perturb(self, encoded_response: List[int]) -> List[int]:
return [self._perturb_bit(b) for b in encoded_response]

def _perturb_bit(self, bit: int) -> int:
sample = np.random.random()
if bit == 1:
if sample <= self._p:
return 1
else:
return 0
elif bit == 0:
if sample <= self._q:
return 1
else:
return 0

Implementing Differential Privacy Using VDK

With the increasing complexity of data management, the open-source Versatile Data Kit (VDK) empowers organizations to handle and secure sensitive data. Leveraging VDK’s capabilities, we can address the challenges of implementing Differential Privacy. Learn more about Versatile Data Kit here!

We will dig deeper into each, and every step required to implement Differential Privacy in detail. We will work with a patient dataset commonly used by researchers as an example to understand the implementation of Differential Privacy using VDK.

Data Ingestion: VDK provides clean interface for ingestion.

Ingestion method is SQLITE

Enabling Differential Privacy: VDK is modular and highly extensible, it has the concept of plugins that can be installed like any other python packages. Once they are installed, we can plug them into a VDK job by quickly changing config. To enable differential privacy, we need to intercept data at pre ingestion step so we can add noise before we sync it. VDK plugins can intercept data at many different points in the data streaming lifecycle.

Random Response Plugin: Let’s see an example of how we can configure and add random response noise to Boolean data type. Consider a study being conducted by some researchers to determine the influence of smoking on cancer. They must study data from various patients while also protecting the patients’ privacy through the use of differentiated privacy and VDK. Check code here.

To install and configure our new plugin Random Response we need to run:

pip install vdk-local-differential-privacy

After installing the Random Response plugin, we need to update the config file:

# update config
[vdk]

ingest_method_default=SQLITE

#add preprocessing step
ingest_payload_preprocess_sequence=random_response_differential_privacy

#set property specific to this plugin
differential_privacy_randomized_response_fields='{"patient_details": ["is_smoker"]}' )

Follow the above code and see how we can add a preprocessing step in config file. In the next step, we set properties specific to this plugin, and the fields we want to randomize or add noise to are located in the “patient_details” table, and the name of column “is_smoke” .

from vdk.api.job_input import IJobInput

def run(job_input: IJobInput):

#60 people who are not smokers
for _ in range(60):
obj = dict(str_key="str", is_smoker=False)

#setup the configuration
job_input.send_object_for_ingestion(
payload=obj, destination_table="patient_details", method="memory"
)

As you can see in script above, we took 60 patients who are not smokers and save all their information to the database. It generates a dictionary and sends it for ingestion into a data table named “patient_details” using “memory” method. Since everyone is not a smoker(is_smoker=False) in the dataset, there is always the same amount of noise in the dataset.

Histogram of Noisy Data

As you can see in the histogram of noisy and randomized data there is ~45 non-smokers and ~15 smokers.

We are using random response plugin, so the genuine value can be reported with a specific probability, whereas the false value can be reported with the complimentary probability.

Understandable data: VDK helps to create noisy data. We can move from noisy data to actual distribution before the noise was added to the data. Moving from noisy data to understandable data involves managing and filtering out the noise.

To achieve this, few points need to be considered:

- Approximately half of the data consists of pure noise.
- About one-fourth of the data is composed of “yes” responses generated from random noise.

To determine the number of real “yes” responses in the data, you subtract the noise-generated “yes” responses from the total number of “yes” responses. Mathematically, this is expressed as:

real yeses = total number of yeses — (1/4 x dataset size)

Since half of the data was discarded due to noise, it’s necessary to adjust for this loss when estimating the actual count of real “yes” responses. Doubling the number of real “yes” responses compensate for the elimination of half the data:

Actual Count of Real Yeses = 2×Real Yeses

This process helps in obtaining a more accurate representation of the true positive responses within the dataset, accounting for the presence of noise and ensuring a better understanding of the underlying information.

By doing these steps we can get real distribution. Check right histogram in image below, we achieved the number 60 and has very less or no margin of error likely to happen.

Data after removing and with Noise.

This can help the researcher to complete studies easily on the impact of smoking on cancer. This also help to preserve individual user privacy because only half of the response in the database are honest response not generated by noise. Know more here.

Unary Encoding Plugin: Similar method will be used as random response in implementing differential privacy using Unary Encoding VDK Plugin.

To install and configure our new plugin Unary Encoding we need to run:

pip install vdk-local-differential-privacy

After installing the Unary Encoding plugin, we need to update the config file:

# update config
[vdk]

ingest_method_default=SQLITE

#add preprocessing step
ingest_payload_preprocess_sequence=unary_encoding_differential_privacy

#set property specific to this plugin
differential_privacy_unary_encoding_fields='{"patient": {"blood": ["A","B","AB","O"]}}'

Follow the above code and see how we can add a preprocessing step in config file. In the next step, we set properties specific to this plugin “unary_encoding_differential_privacy”, and in next step we want to add unary encoding on “patient” table in column “blood” with blood groups as Enum values.

from vdk.api.job_input import IJobInput


def run(job_input: IJobInput):
for _ in range(50):
obj = dict(str_key="str", blood_type="B")

job_input.send_object_for_ingestion(
payload=obj, destination_table="patient", method="memory"
)

In above script you can see, we have 50 patients with blood type “B” and save it to database. Implementing unary encoding using Versatile Data Kit plugin to implement differential privacy is somewhat similar to random response plugin method.

Basically, random response introduces randomness to protect privacy in statistical data, while unary encoding is a binary representation method commonly used for categorical data.

Conclusion

In concluding this article for safeguarding consumer privacy with Differential Privacy and the Versatile Data Kit (VDK), we emphasise the critical importance of ethical data practices. Balancing privacy and creativity need collaborative efforts, and the combination of these tools provides a strong framework for responsible data management. As organisations negotiate this changing world, they must embrace openness, adapt to legislation, and prioritise privacy.

The integration of Differential Privacy and VDK not only protects client privacy but also lays the groundwork for a trustworthy and responsible digital future. VDK is also working on providing support for differential privacy in SQL queries and global differential privacy.

This article is coauthored by Astrodevil and Paul Murphy combining their expertise to provide a well-rounded perspective on the topic.

Additional Resources

💡Check Versatile Data Kit GitHub Repo: https://github.com/vmware/versatile-data-kit

💡Check YouTube Video Tutorial: https://youtu.be/Z9zCtkbKOgU?si=Fuk3dU_PGwPBAPTo

💡Check files related to VDK plugins: https://github.com/vmware/versatile-data-kit/pull/2670/files

💡Check the Getting Started guide of VDK to learn more: https://github.com/vmware/versatile-data-kit/wiki/Getting-Started

--

--