Ensuring End-user Privacy and Data Security for Machine Learning Using AWS Virtual Private Cloud

Karan -, Enrique Garcia Perez and Tassilo Klein (ML Research Berlin)

Big data availability coupled with the steadily decreasing costs for computational processing power are the basis of modern day machine learning (ML). This development along with recent algorithmic improvements has heralded a new era of artificial intelligence: deep learning.

Although the technology has its roots back in the 60s, it has recently experienced unprecedented breakthroughs in all kinds of domains ranging from computer vision to natural language processing. Unsurprisingly, deep learning has moved into the center of attention of both academia and industry. While still evolving at large scale, it is becoming a de-facto standard commodity for data scientists.

With the success of deep learning comes the hunger for ever larger data sets. However, handling the required big data properly can be tricky and painful. Certain information might be sensitive and can not be exposed to the outside world, as data privacy must be guaranteed at all times. The most prominent example in this regard is healthcare data requiring the collection and storage of potentially large amounts of personal identifiable information. In general, processing big data has very dynamic and scalable infrastructure needs due to the fluctuation in tasks and data. A fusion of these recent developments can be tricky and requires safeguarding measures, as well as deliberate planning.

Now imagine you are a small academic or enterprise research group and want to operate ML infrastructure while at the same time guaranteeing data privacy. Our team at SAP ML Research has been working on solving this problem proposing an Amazon Virtual Private Cloud (VPC) infrastructure to make sure that machine learning solutions go hand in hand with data privacy. This article defines a basic VPC topology ideal for ML development and deployment. In short, VPCs in the cloud allow you to define and regulate different access requirements in a very scalable environment. Although components of this VPC can naturally be configured using the AWS console, doing so manually is neither really convenient nor reproducible. Automating this step is therefore key when setting up such infrastructure network.

Here a powerful open-source automation engine comes into play: Ansible. At the end of this tutorial all there is to do is running a simple command that executes an Ansible playbook (that is a configuration, deployment script), and you’re all set! In particular, there is no need to recreate this infrastructure once it is set up. However, you may want to attach additional instances to the VPC, e.g. giving new students or employees an individual computing nodes.

The VPC we created consists of two subnets: public and private.

The public subnet contains components, such as the bastion host, which is a dedicated instance that controls access to the private network by forming a chain of SSH connections, for instance a bastion to computation node.

Within the private subnet, there are components that host or operate on potentially (confidential) data. This might comprise computing instances employed for training models or database servers. As all the inbound traffic flows through the bastion host, it allows for traffic control. In theory, you might chain several bastion hosts for even higher security. However, for simplicity we used a single, yet sufficient, instance. Finally, in order to give the components access to the internet, for instance to install software packages, one more component is necessary. This is where the NAT gateway comes into play. In our example it is implemented with an elastic public IP address attached to it.

VPC Topology: How the Components Interact

Figure 1 shows the schema of the VPC infrastructure consisting of private and public subnets.

Fig. 1: VPC Infrastructure depiction

Ansible automation

Ansible playbooks use so-called roles and tasks to structure the configuration content into modular components. Our project consists of playbook.yml, which contains the single role of creating VPC. The main.yml file further subdivides the role in a list of tasks, which are at a basic level nothing more than a call to an Ansible module. In our case the tasks consist of creating VPC, public and private subnets, NAT gateway, internet gateway and security groups. Please refer to the files in the github repository for specific task details.

Besides the above mentioned files, vars.yml, in contrast to the others requires some minor adaption from you. It contains the definition of all the necessary environment variables, such as the subnet address space and access keys.

Fig. 2 provides an outline and explanation of all key variables in the vars.yml file. You can find the variables to be modified in the file snippet

NOTE: Please change variables in var.yml file, such as your access key and VPC name, according to your needs.

Ansible playbook snippet (vars.yml) for setting up VPC

# Authorization variables for AWS
aws_access_key: "Provide your access key"
aws_secret_key: "Provide Your secret Key"
aws_region:     "eu-west-1"
# Information about VPC
vpc_name:       "ML_Research_VPC"
vpc_cidr_block: "10.0.0.0/16"
# For Security Group Rule - optional (flexible and easy to be modified later)
my_ip:         "X.X.X.X"
# defining public Subnet
public_subnet_1_cidr:  "10.0.0.0/24"
# defining private Subnet
private_subnet_1_cidr: "10.0.1.0/24"

Once you have downloaded all the files from the playbook, all you have to do in order to set up the VPC configuration within AWS, is to run the main playbook script using:

ansible-playbook “your playbook file.yml“ –i inventory –e @vars.yml

Successfully creating the VPC configuration, your terminal should look like in Figure 3:

Fig. 3: Successful Execution of Playbook for VPC Creation

After following the prior steps, you can go to your AWS console and check your recently created VPC configuration (in our case it is ML_RESEARCH_VPC):

Now that you have set up the infrastructure, you can begin to launch instances within. The first instance you might want to instantiate is the bastion host. As described before, it simply acts as a bridge, allowing access to instances within the private subnet only to authorized users.

Configuration of bastion host

As the bastion host fulfills the single job of running an application or service to withstand attacks, it is sufficient to choose a smaller type of EC2 instance (virtual server in Amazon Web Services terminology), such as t2.micro. It is important that the instance is associated with the public subnet, as it needs to be accessible from the Internet on port 22 for SSH. In our case, all requirements are met by the way security groups are implemented (by default only SSH accessible), allowing authorized access to private instances only via passing through bastion host. The bastion host, thus, acts as a checkpoint in the public subnet.

Here is a quick interactive demonstration on how to launch a bastion: The allowed IP range in this case is global, but you might want to constraint it for security reasons via the Ansible playbook depending on your use case

Once the bastion host is up and running, you can access your private instances using SSH only, as they reside within the private network. In order to be SSH accessible, you have to make sure to append the public keys of authorized users in the .ssh/authorized_keys file on both bastion and instance. Please refer to this tutorial for more details.

Launching an EC2 instance in the private subnet of VPC

Now as we have set up all the environment, we are almost done and there is just one more step remaining — launching an EC2 instance for machine learning. Let us demonstrate an example of how you can launch a GPU instances in your VPC and for developing or running your deep learning solutions on it. All you have to do is choose which kind of instance you wish to launch depending on your objective, and then attach it to the created VPC.

SSH Configuration to access your instance

Now that your instance is running, the next step is to build a connection to it. You can access your private instance directly by enabling SSH-forwarding via the bastion host. This just requires editing /.ssh/config file on your local machine. By this you can avoid the tedious process of manually connecting to your bastion and then to the host. All you need to do is to adapt the bold components.

Host 10.0.1.*
    IdentityFile /path-to-your-private-key.pem
    User ec2-user
    ProxyCommand ssh ec2-user@bastion -W %h:%p
    UseKeychain yes
Host bastion
    HostName [ip address of bastion]
    User ec2-user
    IdentityFile /path-to-your-private-key.pem
    ForwardAgent Yes

This will let you essentially access a private instance from your local machine without manually connecting to your bastion and then to your private instance. However, you still might want to connect to the bastion for tasks like putting SSH keys there.

After inserting your IP code you connect to your private instance:

                    ssh [private-ip of instance]

However, you still might want to connect to the bastion for administrative tasks, such as including SSH keys. In order to connect to bastion host, all there is to do is:

                            ssh bastion

Here is how you access the bastion host using the following commands: (the private instance launched has the address 10.0.1.99 )

Important to keep in mind: Generally it is recommended to have two different keys, one for the bastion host and one for your private instance. Never store your private keys on the bastion host, as this compromises the desired privacy.

Finally, the described infrastructure ensures a secure network to develop ML solutions and maintain data. However, in order to make it more robust and less prone to downtime in case your availability zone fails, the topology can be extended beyond this tutorial. Such further step might involve the design of a VPC spanning multiple availability zones, so that all the resources are not located at one specific physical location and to avoid unnecessary delays or downtime.

Now you can start building your own VPC according to your use case. Please feel free to comment for any questions or ideas!

Access the entire playbook using the following github repository.

Karan, a M.Sc. Student from TU Berlin has been focusing his attention on the ML infrastructure outlined in this post.