Getting Started with Scalable Differential Privacy Tools on the Cloud
This article is the final article in a four-part series on Differential Privacy by GovTech’s Data Privacy Protection Capability Centre (DPPCC). Click here to check out the other articles in this series.
Reviewed by: Damien Desfontaines (Tumult Labs, Staff Scientist) and Vadym Doroshenko (Google, Tech Lead of PipelineDP).
Introduction
In our previous articles, we explored how differential privacy is a powerful tool for data analysis, and protects the privacy of individuals. We provided guidance on adopting the latest Python tools (frameworks and libraries) to implement differential privacy. We also shared our analysis on the performance of these tools based on various metrics (e.g. on accuracy and performance) in standalone and distributed environments.
Building on this foundation, we now aim to address a critical need in the field: how to set up scalable differential privacy on the cloud. Real-life datasets often exceed typical memory limits due to their size, requiring distributed computing resources for processing. Scalability hence becomes essential to handle arbitrary-sized datasets. While open-source Python libraries like OpenDP and DiffPrivlib offer differential privacy solutions, they may not support distributed computing, limiting their scalability. For this article, we have focused on Tumult Analytics and PipelineDP, which are specifically designed for distributed computing, allowing us to scale resources quickly and efficiently based on data size and complexity.
With the introduction of these scalable frameworks at the PEPR’22 conference, the accessibility of these open-source Python frameworks developed by renowned institutions and researchers will allow us to unlock the limitations of applying differential privacy on a larger scale. This article provides a practical guide on how potential adopters of Differential Privacy may implement these frameworks on Amazon Web Services (AWS), with the goal of helping readers better get started with using these scalable frameworks in their own data projects.
There are various AWS services that can be used for working with differential privacy, but in this article, we will focus on two key ones: AWS Glue and Amazon Elastic Map Reduce (EMR). AWS Glue is a fully managed service that supports distributed computing with Spark, enabling the preparation and loading of data for analysis, while Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Hadoop and Spark. By discussing both services, we aim to give readers the flexibility to choose the service that best suits their needs.
In the following walkthrough, we will focus on running Python scripts to calculate the population mean for the US Census 2017 Demographic Data using differential privacy, and we will also show the time taken for each process. The diagram below demonstrates what we will be executing in this article.
Note: while the purpose of this example is to demonstrate the use of distributed computing, typically for larger datasets, the current example uses a smaller dataset stored on GitHub for practicality and accessibility. You may wish to substitute the sample dataset used with a larger one to fully appreciate the power of distributed computing.
You may click on the links below to navigate to the relevant sections of this guide:
Implementing Differential Privacy with AWS Glue
A. Getting Started
B. Glue Job for Tumult Analytics
C. Glue Job for PipelineDP
Implementing Differential Privacy with Amazon EMR
A. Getting Started
B. Executing Differential Privacy Scripts with Tumult Analytics and PipelineDP
Note: This guide assumes that you have an existing AWS account. To streamline your understanding and experimentation with differential privacy, feel free to refer to our git repository. Additionally, you will find frequent references to the repository throughout the article.
Implementing Differential Privacy with AWS Glue
AWS Glue is a fully managed, serverless ETL service that makes it easy even for beginners to prepare and load data for analysis. With Glue, we do not need to worry about managing servers or scaling resources and can focus instead on writing and refining our scripts. The serverless architecture of Glue also means that users only pay for the computing resources they use, making it a cost-effective solution when we are doing experiments.
A. Getting Started
Before we can start using Glue, there are some prerequisite steps we need to do, such as creating IAM roles and initialising an S3 bucket to store our data.
Create an IAM Role
To create an IAM role that allows our Glue jobs to access different AWS services, follow these steps:
- Go to the AWS Console and search for the IAM service.
- On the left-hand side panel, click on Roles.
- Click on Create role:
Step 1: Select trusted entity
- Select AWS service as the trusted entity type.
- Select Glue — Allows Glue to call AWS services on your behalf as the use case for the role. Click on the Next: Permissions button.
Step 2: Add permissions
- Select AmazonS3FullAccess and CloudWatchFullAccess (you should have 2 policies selected).
- Note that these policies do not follow the principle of least privilege access. If you are planning to run this in a production environment, it is highly recommended to modify the access privileges accordingly to reduce the risk of unauthorised access to sensitive data and other resources.
Step 3: Name, review, and create
- Enter a role name such as glue-differential-privacy-role. You can leave the other inputs as default. Click on Create role.
Initialise S3 Bucket
To store our data in the cloud, we will need to initialise an S3 bucket. You can do this by following these steps:
- Go to the AWS Console and search for the S3 service.
- Click on the Create bucket button to create a new S3 bucket.
- Choose a unique bucket name, such as dp-experiments-data, and select the region where you want to store your data.
- Once the bucket is created, you can upload your raw data to it. To do this, click on the Upload button and select the files you want to upload.
- We have provided some sample data and sample scripts that you can upload to the S3.
Create a New Glue Job
With sufficient permissions and data at our disposal, we’re now ready to use Glue to prepare and apply differential privacy techniques to our data.
Let’s Glue it all up:
- Go to the AWS Console and search for the Glue service.
- On the left-hand side panel, click on ETL jobs.
- In the Jobs page, select Spark script editor from the options provided.
- Under the Spark script editor options, choose Upload and edit an existing script.
- You can retrieve the scripts that you have uploaded in S3 or choose the files from your local computer.
- Follow the steps in the following part B and C to create the jobs for Tumult Analytics and PipelineDP respectively.
B. Glue Job for Tumult Analytics
- Create a new Glue Job using this script.
- Click on the Job Details tab and, fill in the following information:
Name: differential-privacy-tumult-ana
IAM Role: glue-differential-privacy-role (created in prerequisites)
Type: Spark
Glue Version: Glue 3.0 - Supports spark 3.1, Scala 2, Python 3
Language: Python 3
Worker Type: G 1X (4vCPU and 16GB RAM)
Automatically scale the number of workers: Disable (this is for cost management purposes, if you do need more scalability you can leave this checked).
Job Bookmark: Enable
Flex execution: Enable (to reduce some cost)
Requested number of workers: 10
Job timeout (minutes): 720 minutes (12 hours)
Advanced Properties (default values for un-mentioned)
- Glue data catalog as the Hive metastore: Disable
- Job Parameters:
--additional-python-modules : tmlt.analytics==0.6.1
--column_name : TotalPop
--epsilon : 1
--s3_file : s3://{S3_BUCKET}/real_world_data/acs2017_census_tract_data.csv
--query: MEAN # (can be changed to COUNT, MEAN, SUM, VARIANCE)
--source_id: <ANY ID>
Note: change the S3_BUCKET parameter to your S3 bucket containing the data.
Click Save on the top right corner and then Click Run. Tumult Analytics takes about 4 minutes to run with the chosen dataset and parameters.
C. Glue Job for PipelineDP
- Create another new Glue Job using this script.
- Click on the Job Details tab and, fill in the following information:
Name: differential-privacy-pipelinedp
IAM Role: glue-differential-privacy-role (created in prerequisites)
Type: Spark
Glue Version: Glue 3.0 - Supports spark 3.1, Scala 2, Python 3
Language: Python 3
Worker Type: G 1X (4vCPU and 16GB RAM)
Automatically scale the number of workers: Disable (this is for cost management purposes, if you do need more scalability you can leave this checked).
Job Bookmark: Enable
Flex execution: Enable (to reduce some cost)
Requested number of workers: 10
Job timeout (minutes): 720 minutes (12 hours)
Advanced Properties (default values for un-mentioned)
- Glue data catalog as the Hive metastore: Disable
- Job Parameters:
--additional-python-modules : pipeline-dp==0.2.0
--column_name : TotalPop
--epsilon : 1
--s3_file : s3://{S3_BUCKET}/real_world_data/acs2017_census_tract_data.csv
--query: MEAN # (can be changed to COUNT, MEAN, SUM, VARIANCE)
Note: change the S3_BUCKET parameter to your S3 bucket containing the data.
Click Save on the top right corner and then Click Run. PipelineDP takes about 2 minutes to run with the chosen dataset and parameters.
You can monitor the progress of your job in the Runs tab. Once the status of your run changes from Running to Succeeded, you can view your logs in the Output logs section.
Congratulations, you have successfully run your first scalable differential privacy job in the cloud!
Implementing Differential Privacy with Amazon EMR
For those seeking greater flexibility in setting up their infrastructure, we will also be tackling the same two differential privacy frameworks on AWS EMR. Unlike Glue, EMR requires a deeper level of technical understanding as it involves managing infrastructure but gives you greater control over system-level parameters.
The following diagram provides a system architecture overview of what we will be building:
The architecture employs a multi-AZ public-private VPC model to ensure high availability, with the EMR cluster situated in private subnets that have internet access for downloading required packages on-the-fly. The EMR cluster will leverage on S3 for data storage, including logs, to ensure that all potential data and logs generated during the differential privacy calculation process are securely stored and easily accessible for future analysis.
Note: You can consider further enhancing network security by using VPC endpoints to route requests through PrivateLink to run them solely within the AWS network, rather than through the internet.
A. Getting Started
In order to use EMR, there are some prerequisite steps we need to take, such as creating the base architecture and IAM roles.
Creating the base architecture
The following steps outline the process of creating the necessary infrastructure and network resources to initialise the EMR cluster.
- Navigate to the VPC console and click on the Create VPC button. Provide a name for your VPC and enter a /24 IPv4 CIDR block for your VPC, for example, 10.0.0.0/24.
- Click Create to create your VPC. Note the VPC ID that you have just created.
- Navigate to the Subnets section and create four subnets — two public and two private.
subnet-private-a : 10.0.0.0/27
subnet-private-b : 10.0.0.32/27
subnet-public-a : 10.0.0.64/27
subnet-public-a : 10.0.0.96/27 - Navigate to the Internet Gateways section and create a new Internet Gateway.
- Attach the newly created Internet Gateway to your VPC.
- Navigate to the NAT Gateways section and create a new NAT Gateway.
- Select the public subnet subnet-public-a for your NAT Gateway and allocate an Elastic IP during initialisation.
- Create inbound rules for ports 443 (HTTPS), 8443 (EMR), and 32700–65535 (Hadoop/Spark applications). Outbound rules can be set to all traffic.
- Navigate to the Network ACLs section and create a new Network ACL.
- Ensure that the newly created Network ACL is associated with all four subnets.
- Create two route tables to direct traffic accordingly between the subnets
differential-privacy-private-route route table
— Associate with subnet-private-a and subnet-private-b
— Add a route, 0.0.0.0/0 towards the NAT
differential-privacy-public-route route table
— Associate with subnet-public-a and subnet-public-b
— Add a route, 0.0.0.0/0 towards IGW - At this point, your newly created VPC resource map should resemble the following:
Creating IAM roles
These steps involve creating IAM roles and policies to grant permissions to the EMR cluster.
1. Navigate to the IAM console and click on Roles in the left-hand menu. Click on the Create role button. This role is going to be used for the EMR service role.
a. Select EMR as the AWS service
b. Select EMR — Allows EC2 instances to call AWS services on your behalf as the use case for the role. Click on the Next: Permissions button.
c. Select AmazonElasticMapReduceRole policy and click on the Next: Tags button.
d. Add any desired tags and click on the Next: Review button.
e. Name the role dp-emr-service-role and click on the Create role button.
2. Now, we are going to create the IAM policy that will be used as the EMR Instance Profile.
a. Click on Policies in the left-hand menu. Click on the Create policy button.
b. Select the JSON tab and paste in the following policy, replacing BUCKET_NAME with the name of your S3 bucket.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*"
],
"Resource": [
"arn:aws:s3:::BUCKET_NAME/*",
]
}
]
}
c. Click on the Review policy button and name the policy dp-emr-instance-policy. Click on the Create policy button.
3. Create another IAM role that is going to be used as the EMR Instance Profile
a. AWS service: EC2
b. Add policies: dp-emr-instance-policy, AmazonEMRFullAccessPolicy_v2, AmazonSSMManagedInstanceCore, CloudWatchFullAccess, AmazonEC2FullAccess
c. IAM role name: dp-emr-instance-role
4. Go to the AWS S3 console and select the S3 bucket created in Glue prerequisites.
a. Click on the Create folder button and create a new folder called emr.
b. Inside the emr folder, create two new subfolders called cluster-logs and scripts.
c. Download the scripts from the git repository to your local machine.
d. Upload the downloaded scripts to the scripts subfolder in the S3 bucket.
Create a EMR Cluster
Now for the EMR setup specifics:
1. Go to the AWS Console and search for the EMR service.
2. Click on Create cluster, and change the following:
Name and applications
Name: differential-privacy-emr
Amazon EMR release: emr-6.9.0
Application bundle: Spark (Spark 3.3.0, Hadoop 3.3.3, Zeppelin 0.10.1)
Operating System: Amazon Linux release
Automatically apply latest Amazon Linux updates: Enable
Cluster configuration
Instance groups
Primary: m5.4xlarge, 1 node
Core: m5.4xlarge
Task: N/A (remove)
Cluster scaling and provisioning option:
Set cluster size manually
Core, m5.4xlarge, size : 2 (to manage cost)
Networking
VPC: your newly created VPC
Subnets: subnet-private-a
EC2 security groups (firewall)
Primary Node: ElasticMapReduce-Primary-Private (default)
Core and Task nodes: ElasticMapReduce-Core-Private (default)
Service access: Create ElasticMapReduce-ServiceAccess
Steps
Leave as default
Cluster Termination
Terminate cluster after idle time (1 hour)
Bootstrap actions
Add one bootstrap action
Name: Libraries Installation
Script location: s3://{S3_BUCKET}/emr/scripts/emr_bootstrap.sh
Arguments : N/A
Cluster logs
Publish cluster specific logs to Amazon S3: Enable
S3 location: s3://{S3_BUCKET}/emr/cluster-logs/
Software settings
Load JSON from Amazon S3: s3://{S3_BUCKET}/emr/scripts/software_settings.json
Security configuration and EC2 key pair
Leave as default
Identity and Access Management (IAM) roles
Choose an existing service role: dp-emr-service-role
Choose an existing instance profile: dp-emr-instance-role
3. Once you click Create cluster, the entire bootup process will take approximately 20 minutes.
- After the EMR bootup process is completed, as indicated by the Waiting status, we can proceed with the execution of our differential privacy scripts.
B. Executing Differential Privacy Scripts with Tumult Analytics and PipelineDP
1. To execute your scripts on the EMR cluster, you need to use the Steps feature. To add a step, go to the EMR console and select your cluster. Then, navigate to the Steps tab and click on Add step.
2. For Tumult Analytics, fill in the following information:
Type: Spark application
Name: differential-privacy-tumult-ana
Deploy mode: Cluster mode
Application location: s3://{S3_BUCKET}/emr/scripts/run_tmlt_ana_spark.py
Spark-submit options: N/A
Arguments: -c TotalPop -e 1 -i TUMULT_ANALYTICS_ID -f s3://dp-experiments-data/real_world_data/acs2017_census_tract_data.csv -q MEAN
Step Action: Continue
The Python script can be found here. The job will take roughly 4 minutes.
3. For PipelineDP, fill in the following information:
Type: Spark application
Name: differential-privacy-pipelinedp
Deploy mode: Cluster mode
Application location: s3://{S3_BUCKET}/emr/scripts/run_pipelinedp_spark.py
Spark-submit options: N/A
Arguments: -c TotalPop -e 1 -f s3://dp-experiments-data/real_world_data/acs2017_census_tract_data.csv -q MEAN
Step Action: Continue
The Python script can be found here. The job will take roughly 2 minutes.
4. Once the job has been completed successfully, the status will be updated to Completed.
5. To repeat the process, simply clone the steps and adjust the parameters as needed.
Great job! You now have a running scalable Differential Privacy job on EMR!
Some tips on using Amazon EMR:
- Prior to setting up the EMR infrastructure, it’s important to keep in mind the cost involved. To get an estimate, you can use the AWS calculator for EMR here. The cost of EMR instances will continue to accrue unless you terminate them, so it’s important to always remember to terminate your instances when you’re finished using them. You can easily bring the cluster back up later by using the Clone option on your terminated EMR.
- If you need to use CLI commands on your EMR cluster, you can utilise Session Manager (SSM). To access it, navigate to the AWS Console, search for “Session Manager”, select the correct instance ID of your master node (which can be found in the EMR details under the Instances tab), and then click on Start Session. This will launch an interactive shell session that you can use to run CLI commands on your cluster.
Conclusion
The key privacy guarantee of differential privacy is indeed critical for protecting the privacy of individuals while still allowing for data analysis. As data privacy concerns continue to grow, the importance of differential privacy will only increase. The implementation of differential privacy at scale can be difficult, but with the availability of powerful frameworks such as Tumult Analytics and PipelineDP, it is becoming more feasible for data practitioners to utilise. Ultimately, by utilising differential privacy, we can ensure that individuals’ sensitive data is protected while still allowing for data-driven insights to be gained.
We hope that you have enjoyed and found value in our four-part series on differential privacy. We appreciate you following us on this journey and learning about this important topic with us. By now, you should have a solid understanding of the key concepts and tools related to differential privacy, as well as how to implement it in a scalable manner using AWS services like Glue and EMR. We encourage you to continue exploring and experimenting with differential privacy in your own data projects and take on the next leg of data protection!
Differential Privacy Series
GovTech’s DPPCC has published a four-part series to demystify, evaluate, and provide practical guidance on implementing differential privacy tools. These tools include PipelineDP by Google and Openmined, Tumult Analytics by Tumult Labs, OpenDP by the privacy team at Harvard, and Diffprivlib by IBM. Our analysis can be helpful to ensure that the tools can be used effectively in real-world applications of differential privacy.
- Part 1: Sharing Data with Differential Privacy: A Primer — A beginner’s guide to understanding the fundamental concepts of differential privacy with simplified mathematical interpretation.
- Part 2: Practitioners’ Guide to Accessing Emerging Differential Privacy Tools — Explore the emerging differential privacy tools developed by prominent researchers and institutions, with practical guidance on their adoption for real-world use cases.
- Part 3: Evaluating Differential Privacy Tools’ Performance — A comparative analysis of the accuracy and execution time of differential privacy tools in both standalone and distributed environments, with a focus on common analytical queries.
- Part 4: Getting Started with Scalable Differential Privacy Tools on the Cloud (this article).
DPPCC is working towards building a user-friendly web interface to help non-experts better understand and implement differential privacy, and facilitate privacy-centric data sharing.
For questions and collaboration opportunities, please reach out to us at enCRYPT@tech.gov.sg.
Thanks, Ghim Eng Yap (ghimeng@dsaid.gov.sg), Alan Tang (alantang@dsaid.gov.sg), and Anshu Singh (anshu@dsaid.gov.sg) for the valuable inputs.