Getting Started with Scalable Differential Privacy Tools on the Cloud

Published in

AI Practice GovTech

15 min readApr 26, 2023

This article is the final article in a four-part series on Differential Privacy by GovTech’s Data Privacy Protection Capability Centre (DPPCC). Click here to check out the other articles in this series.

Reviewed by: Damien Desfontaines (Tumult Labs, Staff Scientist) and Vadym Doroshenko (Google, Tech Lead of PipelineDP).

Introduction

In our previous articles, we explored how differential privacy is a powerful tool for data analysis, and protects the privacy of individuals. We provided guidance on adopting the latest Python tools (frameworks and libraries) to implement differential privacy. We also shared our analysis on the performance of these tools based on various metrics (e.g. on accuracy and performance) in standalone and distributed environments.

Building on this foundation, we now aim to address a critical need in the field: how to set up scalable differential privacy on the cloud. Real-life datasets often exceed typical memory limits due to their size, requiring distributed computing resources for processing. Scalability hence becomes essential to handle arbitrary-sized datasets. While open-source Python libraries like OpenDP and DiffPrivlib offer differential privacy solutions, they may not support distributed computing, limiting their scalability. For this article, we have focused on Tumult Analytics and PipelineDP, which are specifically designed for distributed computing, allowing us to scale resources quickly and efficiently based on data size and complexity.

With the introduction of these scalable frameworks at the PEPR’22 conference, the accessibility of these open-source Python frameworks developed by renowned institutions and researchers will allow us to unlock the limitations of applying differential privacy on a larger scale. This article provides a practical guide on how potential adopters of Differential Privacy may implement these frameworks on Amazon Web Services (AWS), with the goal of helping readers better get started with using these scalable frameworks in their own data projects.

There are various AWS services that can be used for working with differential privacy, but in this article, we will focus on two key ones: AWS Glue and Amazon Elastic Map Reduce (EMR). AWS Glue is a fully managed service that supports distributed computing with Spark, enabling the preparation and loading of data for analysis, while Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Hadoop and Spark. By discussing both services, we aim to give readers the flexibility to choose the service that best suits their needs.

In the following walkthrough, we will focus on running Python scripts to calculate the population mean for the US Census 2017 Demographic Data using differential privacy, and we will also show the time taken for each process. The diagram below demonstrates what we will be executing in this article.

Note: while the purpose of this example is to demonstrate the use of distributed computing, typically for larger datasets, the current example uses a smaller dataset stored on GitHub for practicality and accessibility. You may wish to substitute the sample dataset used with a larger one to fully appreciate the power of distributed computing.

You may click on the links below to navigate to the relevant sections of this guide:

Implementing Differential Privacy with AWS Glue

A. Getting Started
B. Glue Job for Tumult Analytics
C. Glue Job for PipelineDP

Implementing Differential Privacy with Amazon EMR

A. Getting Started
B. Executing Differential Privacy Scripts with Tumult Analytics and PipelineDP

Note: This guide assumes that you have an existing AWS account. To streamline your understanding and experimentation with differential privacy, feel free to refer to our git repository. Additionally, you will find frequent references to the repository throughout the article.

Implementing Differential Privacy with AWS Glue

AWS Glue is a fully managed, serverless ETL service that makes it easy even for beginners to prepare and load data for analysis. With Glue, we do not need to worry about managing servers or scaling resources and can focus instead on writing and refining our scripts. The serverless architecture of Glue also means that users only pay for the computing resources they use, making it a cost-effective solution when we are doing experiments.

A. Getting Started

Before we can start using Glue, there are some prerequisite steps we need to do, such as creating IAM roles and initialising an S3 bucket to store our data.

Create an IAM Role

To create an IAM role that allows our Glue jobs to access different AWS services, follow these steps:

Go to the AWS Console and search for the IAM service.
On the left-hand side panel, click on Roles.
Click on Create role:

Step 1: Select trusted entity

Select AWS service as the trusted entity type.
Select Glue — Allows Glue to call AWS services on your behalf as the use case for the role. Click on the Next: Permissions button.

Step 2: Add permissions

Select AmazonS3FullAccess and CloudWatchFullAccess (you should have 2 policies selected).
Note that these policies do not follow the principle of least privilege access. If you are planning to run this in a production environment, it is highly recommended to modify the access privileges accordingly to reduce the risk of unauthorised access to sensitive data and other resources.

Step 3: Name, review, and create

Glue IAM — Name, review and create ( Step 3 )

Enter a role name such as glue-differential-privacy-role. You can leave the other inputs as default. Click on Create role.

Initialise S3 Bucket

To store our data in the cloud, we will need to initialise an S3 bucket. You can do this by following these steps:

Go to the AWS Console and search for the S3 service.
Click on the Create bucket button to create a new S3 bucket.
Choose a unique bucket name, such as dp-experiments-data, and select the region where you want to store your data.
Once the bucket is created, you can upload your raw data to it. To do this, click on the Upload button and select the files you want to upload.
We have provided some sample data and sample scripts that you can upload to the S3.

Create a New Glue Job

With sufficient permissions and data at our disposal, we’re now ready to use Glue to prepare and apply differential privacy techniques to our data.

Let’s Glue it all up:

Go to the AWS Console and search for the Glue service.
On the left-hand side panel, click on ETL jobs.
In the Jobs page, select Spark script editor from the options provided.
Under the Spark script editor options, choose Upload and edit an existing script.
You can retrieve the scripts that you have uploaded in S3 or choose the files from your local computer.
Follow the steps in the following part B and C to create the jobs for Tumult Analytics and PipelineDP respectively.

B. Glue Job for Tumult Analytics

Create a new Glue Job using this script.
Click on the Job Details tab and, fill in the following information:

Name: differential-privacy-tumult-ana
IAM Role: glue-differential-privacy-role (created in prerequisites)
Type: Spark
Glue Version: Glue 3.0 - Supports spark 3.1, Scala 2, Python 3
Language: Python 3
Worker Type: G 1X (4vCPU and 16GB RAM)
Automatically scale the number of workers: Disable (this is for cost management purposes, if you do need more scalability you can leave this checked).
Job Bookmark: Enable
Flex execution: Enable (to reduce some cost)
Requested number of workers: 10
Job timeout (minutes): 720 minutes (12 hours)

Advanced Properties (default values for un-mentioned)
- Glue data catalog as the Hive metastore: Disable
- Job Parameters:
   --additional-python-modules : tmlt.analytics==0.6.1
   --column_name : TotalPop
   --epsilon : 1
   --s3_file : s3://{S3_BUCKET}/real_world_data/acs2017_census_tract_data.csv
   --query: MEAN   # (can be changed to COUNT, MEAN, SUM, VARIANCE)
   --source_id: <ANY ID>
Note: change the S3_BUCKET parameter to your S3 bucket containing the data.

Click Save on the top right corner and then Click Run. Tumult Analytics takes about 4 minutes to run with the chosen dataset and parameters.

C. Glue Job for PipelineDP

Create another new Glue Job using this script.
Click on the Job Details tab and, fill in the following information:

Name: differential-privacy-pipelinedp
IAM Role: glue-differential-privacy-role (created in prerequisites)
Type: Spark
Glue Version: Glue 3.0 - Supports spark 3.1, Scala 2, Python 3
Language: Python 3
Worker Type: G 1X (4vCPU and 16GB RAM)
Automatically scale the number of workers: Disable (this is for cost management purposes, if you do need more scalability you can leave this checked).
Job Bookmark: Enable
Flex execution: Enable (to reduce some cost)
Requested number of workers: 10
Job timeout (minutes): 720 minutes (12 hours)

Advanced Properties (default values for un-mentioned)
- Glue data catalog as the Hive metastore: Disable
- Job Parameters:
   --additional-python-modules : pipeline-dp==0.2.0
   --column_name : TotalPop
   --epsilon : 1
   --s3_file : s3://{S3_BUCKET}/real_world_data/acs2017_census_tract_data.csv
   --query: MEAN    # (can be changed to COUNT, MEAN, SUM, VARIANCE)
Note: change the S3_BUCKET parameter to your S3 bucket containing the data.

Click Save on the top right corner and then Click Run. PipelineDP takes about 2 minutes to run with the chosen dataset and parameters.

You can monitor the progress of your job in the Runs tab. Once the status of your run changes from Running to Succeeded, you can view your logs in the Output logs section.

Congratulations, you have successfully run your first scalable differential privacy job in the cloud!

Implementing Differential Privacy with Amazon EMR

For those seeking greater flexibility in setting up their infrastructure, we will also be tackling the same two differential privacy frameworks on AWS EMR. Unlike Glue, EMR requires a deeper level of technical understanding as it involves managing infrastructure but gives you greater control over system-level parameters.

The following diagram provides a system architecture overview of what we will be building:

The architecture employs a multi-AZ public-private VPC model to ensure high availability, with the EMR cluster situated in private subnets that have internet access for downloading required packages on-the-fly. The EMR cluster will leverage on S3 for data storage, including logs, to ensure that all potential data and logs generated during the differential privacy calculation process are securely stored and easily accessible for future analysis.

Note: You can consider further enhancing network security by using VPC endpoints to route requests through PrivateLink to run them solely within the AWS network, rather than through the internet.

A. Getting Started

In order to use EMR, there are some prerequisite steps we need to take, such as creating the base architecture and IAM roles.

Creating the base architecture

The following steps outline the process of creating the necessary infrastructure and network resources to initialise the EMR cluster.

Navigate to the VPC console and click on the Create VPC button. Provide a name for your VPC and enter a /24 IPv4 CIDR block for your VPC, for example, 10.0.0.0/24.
Click Create to create your VPC. Note the VPC ID that you have just created.
Navigate to the Subnets section and create four subnets — two public and two private.
subnet-private-a : 10.0.0.0/27
subnet-private-b : 10.0.0.32/27
subnet-public-a : 10.0.0.64/27
subnet-public-a : 10.0.0.96/27
Navigate to the Internet Gateways section and create a new Internet Gateway.
Attach the newly created Internet Gateway to your VPC.
Navigate to the NAT Gateways section and create a new NAT Gateway.
Select the public subnet subnet-public-a for your NAT Gateway and allocate an Elastic IP during initialisation.
Create inbound rules for ports 443 (HTTPS), 8443 (EMR), and 32700–65535 (Hadoop/Spark applications). Outbound rules can be set to all traffic.
Navigate to the Network ACLs section and create a new Network ACL.
Ensure that the newly created Network ACL is associated with all four subnets.
Create two route tables to direct traffic accordingly between the subnets
differential-privacy-private-route route table
— Associate with subnet-private-a and subnet-private-b
— Add a route, 0.0.0.0/0 towards the NAT
differential-privacy-public-route route table
— Associate with subnet-public-a and subnet-public-b
— Add a route, 0.0.0.0/0 towards IGW
At this point, your newly created VPC resource map should resemble the following:

Creating IAM roles

These steps involve creating IAM roles and policies to grant permissions to the EMR cluster.

1. Navigate to the IAM console and click on Roles in the left-hand menu. Click on the Create role button. This role is going to be used for the EMR service role.

a. Select EMR as the AWS service
b. Select EMR — Allows EC2 instances to call AWS services on your behalf as the use case for the role. Click on the Next: Permissions button.
c. Select AmazonElasticMapReduceRole policy and click on the Next: Tags button.
d. Add any desired tags and click on the Next: Review button.
e. Name the role dp-emr-service-role and click on the Create role button.

2. Now, we are going to create the IAM policy that will be used as the EMR Instance Profile.

a. Click on Policies in the left-hand menu. Click on the Create policy button.
b. Select the JSON tab and paste in the following policy, replacing BUCKET_NAME with the name of your S3 bucket.

{
  "Version": "2012-10-17",
  "Statement": [
 {
     "Effect": "Allow",
     "Action": [
         "s3:Get*",
         "s3:List*"
     ],
     "Resource": [
         "arn:aws:s3:::BUCKET_NAME/*",
     ]
 }
  ]
}

c. Click on the Review policy button and name the policy dp-emr-instance-policy. Click on the Create policy button.

3. Create another IAM role that is going to be used as the EMR Instance Profile

a. AWS service: EC2
b. Add policies: dp-emr-instance-policy, AmazonEMRFullAccessPolicy_v2, AmazonSSMManagedInstanceCore, CloudWatchFullAccess, AmazonEC2FullAccess
c. IAM role name: dp-emr-instance-role

4. Go to the AWS S3 console and select the S3 bucket created in Glue prerequisites.

a. Click on the Create folder button and create a new folder called emr.
b. Inside the emr folder, create two new subfolders called cluster-logs and scripts.
c. Download the scripts from the git repository to your local machine.
d. Upload the downloaded scripts to the scripts subfolder in the S3 bucket.

Create a EMR Cluster

Now for the EMR setup specifics:

1. Go to the AWS Console and search for the EMR service.

2. Click on Create cluster, and change the following:

Name and applications
  Name: differential-privacy-emr
  Amazon EMR release: emr-6.9.0
  Application bundle: Spark (Spark 3.3.0, Hadoop 3.3.3, Zeppelin 0.10.1)
  Operating System: Amazon Linux release
  Automatically apply latest Amazon Linux updates: Enable

Cluster configuration
  Instance groups
  Primary: m5.4xlarge, 1 node
  Core: m5.4xlarge
  Task: N/A (remove)
  Cluster scaling and provisioning option: 
    Set cluster size manually
    Core, m5.4xlarge, size : 2 (to manage cost)

Networking
  VPC: your newly created VPC
  Subnets: subnet-private-a
  EC2 security groups (firewall)
   Primary Node: ElasticMapReduce-Primary-Private (default)
   Core and Task nodes: ElasticMapReduce-Core-Private (default)
   Service access: Create ElasticMapReduce-ServiceAccess

Steps
  Leave as default

Cluster Termination
  Terminate cluster after idle time (1 hour)

Bootstrap actions
  Add one bootstrap action
    Name: Libraries Installation
    Script location: s3://{S3_BUCKET}/emr/scripts/emr_bootstrap.sh
    Arguments : N/A

Cluster logs
  Publish cluster specific logs to Amazon S3: Enable
  S3 location: s3://{S3_BUCKET}/emr/cluster-logs/

Software settings
  Load JSON from Amazon S3: s3://{S3_BUCKET}/emr/scripts/software_settings.json

Security configuration and EC2 key pair
  Leave as default

Identity and Access Management (IAM) roles
  Choose an existing service role: dp-emr-service-role
  Choose an existing instance profile: dp-emr-instance-role

3. Once you click Create cluster, the entire bootup process will take approximately 20 minutes.

After the EMR bootup process is completed, as indicated by the Waiting status, we can proceed with the execution of our differential privacy scripts.

B. Executing Differential Privacy Scripts with Tumult Analytics and PipelineDP

1. To execute your scripts on the EMR cluster, you need to use the Steps feature. To add a step, go to the EMR console and select your cluster. Then, navigate to the Steps tab and click on Add step.

2. For Tumult Analytics, fill in the following information:

Type: Spark application
Name: differential-privacy-tumult-ana
Deploy mode: Cluster mode
Application location: s3://{S3_BUCKET}/emr/scripts/run_tmlt_ana_spark.py
Spark-submit options: N/A
Arguments: -c TotalPop -e 1 -i TUMULT_ANALYTICS_ID -f s3://dp-experiments-data/real_world_data/acs2017_census_tract_data.csv -q MEAN

Step Action: Continue

The Python script can be found here. The job will take roughly 4 minutes.

3. For PipelineDP, fill in the following information:

Type: Spark application
Name: differential-privacy-pipelinedp
Deploy mode: Cluster mode
Application location: s3://{S3_BUCKET}/emr/scripts/run_pipelinedp_spark.py
Spark-submit options: N/A
Arguments: -c TotalPop -e 1 -f s3://dp-experiments-data/real_world_data/acs2017_census_tract_data.csv -q MEAN

Step Action: Continue

The Python script can be found here. The job will take roughly 2 minutes.

4. Once the job has been completed successfully, the status will be updated to Completed.

EMR Steps Status for Tumult Analytics

EMR Steps Status for PipelineDP

5. To repeat the process, simply clone the steps and adjust the parameters as needed.

Great job! You now have a running scalable Differential Privacy job on EMR!

Some tips on using Amazon EMR:

Prior to setting up the EMR infrastructure, it’s important to keep in mind the cost involved. To get an estimate, you can use the AWS calculator for EMR here. The cost of EMR instances will continue to accrue unless you terminate them, so it’s important to always remember to terminate your instances when you’re finished using them. You can easily bring the cluster back up later by using the Clone option on your terminated EMR.
If you need to use CLI commands on your EMR cluster, you can utilise Session Manager (SSM). To access it, navigate to the AWS Console, search for “Session Manager”, select the correct instance ID of your master node (which can be found in the EMR details under the Instances tab), and then click on Start Session. This will launch an interactive shell session that you can use to run CLI commands on your cluster.

Conclusion

The key privacy guarantee of differential privacy is indeed critical for protecting the privacy of individuals while still allowing for data analysis. As data privacy concerns continue to grow, the importance of differential privacy will only increase. The implementation of differential privacy at scale can be difficult, but with the availability of powerful frameworks such as Tumult Analytics and PipelineDP, it is becoming more feasible for data practitioners to utilise. Ultimately, by utilising differential privacy, we can ensure that individuals’ sensitive data is protected while still allowing for data-driven insights to be gained.

We hope that you have enjoyed and found value in our four-part series on differential privacy. We appreciate you following us on this journey and learning about this important topic with us. By now, you should have a solid understanding of the key concepts and tools related to differential privacy, as well as how to implement it in a scalable manner using AWS services like Glue and EMR. We encourage you to continue exploring and experimenting with differential privacy in your own data projects and take on the next leg of data protection!

Differential Privacy Series

GovTech’s DPPCC has published a four-part series to demystify, evaluate, and provide practical guidance on implementing differential privacy tools. These tools include PipelineDP by Google and Openmined, Tumult Analytics by Tumult Labs, OpenDP by the privacy team at Harvard, and Diffprivlib by IBM. Our analysis can be helpful to ensure that the tools can be used effectively in real-world applications of differential privacy.

Part 1: Sharing Data with Differential Privacy: A Primer — A beginner’s guide to understanding the fundamental concepts of differential privacy with simplified mathematical interpretation.
Part 2: Practitioners’ Guide to Accessing Emerging Differential Privacy Tools — Explore the emerging differential privacy tools developed by prominent researchers and institutions, with practical guidance on their adoption for real-world use cases.
Part 3: Evaluating Differential Privacy Tools’ Performance — A comparative analysis of the accuracy and execution time of differential privacy tools in both standalone and distributed environments, with a focus on common analytical queries.
Part 4: Getting Started with Scalable Differential Privacy Tools on the Cloud (this article).

DPPCC is working towards building a user-friendly web interface to help non-experts better understand and implement differential privacy, and facilitate privacy-centric data sharing.

For questions and collaboration opportunities, please reach out to us at enCRYPT@tech.gov.sg.

Thanks, Ghim Eng Yap (ghimeng@dsaid.gov.sg), Alan Tang (alantang@dsaid.gov.sg), and Anshu Singh (anshu@dsaid.gov.sg) for the valuable inputs.

Getting Started with Scalable Differential Privacy Tools on the Cloud

Introduction

Implementing Differential Privacy with AWS Glue

A. Getting Started

B. Glue Job for Tumult Analytics

C. Glue Job for PipelineDP

Implementing Differential Privacy with Amazon EMR

A. Getting Started

B. Executing Differential Privacy Scripts with Tumult Analytics and PipelineDP

Conclusion

Differential Privacy Series

Written by Syahri Ikram