During workshops, I often see participants wrestle with software installation before they can get started. This wasts already limited time that can be spent on learning. Wouldn’t it be nice if this could be avoided?
For our PyData workshop with 80 participants, we decided to develop a stack that provisions a dedicated, zero-setup environment for each participant. We opted for a cloud-based environment because this allowed us to design the structure. In this blog post, I share our experiences and provide the source code, which you could use for your own workshop.
The structure is as follows. First, the challenge is described and the main requirements are motivated. Then, the main design choices and stack are explained. Finally, suggestions for improvement are given.
Schedule Docker containers with ECS and EC2 to provide participants with dedicated workshop environments. The Task Placement Constraints are essential in scheduling the tasks on the right EC2 instance, leading to simple architecture. You can use our GitHub repo as a starting point.
Our workshop covers the usage of Apache Airflow for machine learning projects. Airflow is a framework that lets you programmatically author, schedule and monitor workflows. Participants quickly learn how to define workflows and schedule some simple machine learning pipelines. Airflow comes with a webservice and web interface which are both needed during the workshop.
Since programming is part of the workshop, Jupyter is used as a browser-based text editor. Jupyter runs a webservice, which is set up in parallel to Airflow webservice.
Workshop in the Cloud
We now know what we need to deploy: Apache Airflow and Jupyter. A large-scale cloud deployment demands additional considerations compared to working on your local laptop:
- No matter what participants do, it shouldn’t affect other participants. This means that an isolated environment (i.e. container) is needed for every participant.
- Containers need to automatically restart in case of a crash (fault tolerance).
- Work must be preserved if a container crashes (persistent storage).
- The architecture should work for many participants (scalability).
The workshop leader must be able to reliably deploy, destruct and monitor the AWS environment. For this we use Terraform and the Python boto3 library. Only managed services are used to avoid spending time on software development.
The technical implementation ideally consists of simply starting the needed containers for each participant, without worrying about provisioning servers. The use of managed services such as Fargate or Elastic Beanstalk would fit such an approach. But the cloud-related considerations force us to look for more configurable AWS tools:
- Fault tolerance: ECS for restarting containers
- Persistent storage: EC2 as a container host
- Scalability & isolation: EC2 and ECS for replicating services and tasks
Figure 1 shows the high-level architecture that we ended up with, using the following components and responsibilities:
- ECS Tasks: define containers and IAM roles
- ECS Services: manage ECS Tasks
- EC2 instance: host containers and store participant work
- RDS: database needed for Apache Airflow
- ALB: expose containers to the internet
- S3: keep static files needed in the assignments
A dedicated t3.xlarge instance (ECS optimized) is provided per participant. This large instance ensures that Airflow and Jupyter spawn quickly inside the container. This is important, since ECS will kill the container if the health check fails too often.
Airflow and Jupyter exchange files via a mounted folder on the instance storage. The folder is mounted using a Docker volume. This approach assumes that the EC2 instance is not stopped and started or terminated during the workshop.
The main reason to choose for EC2 over Fargate is that the latter (currently) does not allow persistent storage relative to the container lifetime. The workshop needs persistent storage, since a crash of the Jupyter or Airflow application may cause saved assignments to be lost. Persistent storage options are instance storage, EFS or EBS. We chose the instance storage for simplicity, since EFS and EBS require additional configurations.
Elastic Container Service
ECS is a managed AWS service that allows you to run containers. ECS controls how containers are scheduled, deployed and placed in the network. The Task Definition contains the specifications of the container to be run, such as the Docker image and environment variables. The ECS Service starts the task and exposes the container to the load balancer.
Since we use EC2 instance storage, it is crucial that the containers are started on the right instance. For example, the containers for participant 1 is must be started on the EC2 instance for participant 1. This can be done with placement constraints. Placement constraints tell ECS on which EC2 instance to schedule the task. Therefore, each participant will have a dedicated task, and since there is one task per ECS service, they will also have their own service.
The ECS Tasks starts containers with the awsvpc networking mode, to enable exposing ports to the load balancer. This networking mode requires starting the containers in a private subnet. Without a NAT Gateway, the containers will not have internet access, which is the case in the current version. A load balancer is needed to expose the containers to the participants.
Application Load Balancer
The ALB distributes incoming internet traffic to the running webservices, referred to as targets. The ALB provides access via a url or dns name. It also monitors the health of the targets. Since each participant has a webservice for both Airflow and Jupyter, the number of targets is twice the number of participants: 160 for our workshop!
This large number of targets challenges the scalability of this approach. AWS resource limits set boundaries to the number of targets that you may place in an ALB. The hard limit of 50 listeners per load balancer requires us to use multiple load balancers. We choose to place 20 participants per ALB and expose the webservices under dedicated ports. For example, particpant 1 would have ports 8001(Jupyter) and 9001(Airflow), participant 2 would have ports 8002 and 9002, etc.
Terraform is an infrastructure as code (IaS) tool, that allows you to manage your infrastructure via configuration. Terraform supports AWS, Postgres and many other kinds of resources. All resources in this project are managed using Terraform. It was hardly possible to deploy the large number of resources we needed all at once. For example, a notorious RPC error kept popping up when refreshing the Terraform state. We managed these problems by splitting the project into smaller deployments.
Terraform modules are used to separate shared and user resources. The shared module contains the VPC, networking and the shared RDS instance. The user module contains all resources that need to be unique workshop environments for the participant. This module is thus called for every participant.
First, the shared resources are deployed, and the state is put in a remote state. During successive
terraform apply calls, the participants resources are created, 10 at a time. The state for each group of 10 participants is also stored remotely. The participants resources obtain details about the shared resources via the remote state.
One last challenge needs to be met to make our approach work: the large number of module calls. Calling the user module for each participant is only possible by explicitly writing the Terraform code. Terraform does not support looping over modules. Since we want to keep the code flexible, we used Jinja2 templates to generate the Terraform scripts that calls the user module.
AWS can provide your workshop with dedicated environments for a large number of participants. ECS and EC2 make it possible to create a scalable setup, that can recover if crashes occur. Utilizing the ECS Task Placement Constraints proves essential in creating a simple architecture. Terraform experiences difficulties managing a large number of resources, but a workaround via Python with Jinja2 shows to be reliable.
The Terraform project is available on GitHub. The readme provides instructions on how to deploy the stack. If you have any questions about how to use or modify it, please reach out.
- Optimize health checks in the target groups by giving Airflow more time to spawn and migrate the database. This may allow for the use of lighter EC2 instances.
- Check in advance if your venue allows accessing servers under ports other than 80 and 443 over WIFI.
- The architecture heavily relies on ALBs to route the users to their Docker containers. A reverse proxy that has service discovery can make this stack simpler.
About the author
Dick Abma is a data engineer at BigData Republic, an expert consultancy firm in data science & engineering. He specializes in business applications for the industry. If you are interested in applying data engineering for your business, feel free to contact us at email@example.com.