From notebook hell to container heaven, Part II. I moved my notebook code to a container. It works on my machine… What about everywhere else?
This article is the second chapter of a 3-parts tutorial:
- Part I: From notebook hell to container heaven
- Part II: I moved my notebook code to a container. It works on my machine… What about everywhere else?
- Part III: coming soon…
- The github repository can be found here: https://github.com/datamindedbe/webinar-containers
Recap of Part I
In the first part of this tutorial, we looked at the advantages and shortcomings of using Jupyter notebooks for building data and ML pipelines: https://medium.com/@pierre.borckmans/from-notebook-hell-to-container-heaven-6d84ca7c44bd.
We looked at a concrete use case, the Titanic survival prediction; starting from a simple notebook and explaining how to leverage containers to make our code cleaner, more portable and reusable.
We ended up with a Docker image that can be run on our local machine. We made sure to fix some of the pitfalls of the notebook approach along the way:
- we got rid of hard-coded configuration
- we made sure no secrets were hardcoded in our code
- we made it so that our container could run under different environments and circumstances.
Shipping our container to the cloud
In this second part of our tutorial, we are going to address the obvious next step. How do you get from the famous “it works on my machine” to “it works everywhere”? This is, after all, the promise of containers…
To illustrate that, we are going to show how our container can run in the cloud. We will look at the following items:
- 🚀 Container runtimes and their vast ecosystem
- ☁️ Options for running containers in the cloud
- ⚡ Deploying our use case on AWS Fargate
Container runtimes and their vast ecosystem
In order for us to containerize our Titanic prediction tool, 3 main ingredients are needed to embrace the containers ecosystem:
- Build: We need to use a tool to let use declare what we need in the container to be able to run our application.
- Store: we need a place to host our images so that our runtime will be able to fetch those image and run them as containers. This is typically called a container registry. There are quite a few of them out there.
- Run: we need a container runtime to let us run our freshly built images. They come in many different flavours as well.
So far, we have been using Docker on our machine to build our image, host it and run it as a container. All of this happened on our poor laptop 💪.
The nice thing about containers is that there are standards that define what an image should look like, the corresponding binary format to encode it, and how to run it. This means we are not tied to a particular builder, container registry nor to a container runtime. The Open Container Initiative (OCI) defines those standards.
In the wild, there are many options for the 3 steps we mentioned before:
- build: Buildkit is probably the most well-known, also used by docker. But there are many contenders in this category (Podman, Kaniko, Buildah, …), some of them sharing parts of the others under the hood.
- store: Container registries come in all shapes and flavours. Docker Hub is by and large the most popular, but many platforms offer their own (Github, AWS, Azure, …) and there are also self-hosted open-source options (Portus, Quay, …)
- run: When it comes to running a container, there’s also plenty of choice: containerd is probably the most used, but others like CRI-O are gaining popularity.
Options to run containers in the cloud or on-premise
Now that we understand our options for building, storing and running containers, let’s have a look at our options in the cloud. Building and storing images is quite straightforward, and usually cloud vendors only offer one container registry, so no big question there: AWS Elastic Container Registry (ECR), Azure Container Registry (ACR), Google Container Registry.
When it comes to running our containers, however, it can be a bit overwhelming to look at the landscape ; there are many options covering various use cases. Each cloud vendor offers a range of services, and picking the right one depends on your needs:
- Are you willing to pay a bit more to get a fully-managed experience (do you care where your container runs, and do you need to access them)?
- Do you care about how much time it takes for the container to start and become available?
- Are you planning to run a few containers or a gazillion of them? (yes kubernetes, I’m looking at you)
- …
Using AWS ECR, ECS and Fargate to run our Titanic use-case.
For our use-case today, we will focus on:
- AWS ECR to store our image
- AWS ECS (Elastic Container Service) and Fargate to run our container ; it offers a fully managed experience and it is not too painful to setup (although not entirely trivial as we will see).
Storing our image on AWS ECR
The first step is to create an ECR repository. Once we have it, we can build and publish our image to it, as depicted here:
Creating an ECS cluster, a task definition and running a task on Fargate
Now that our image is available on ECR, it is time to provision ECS to run the container on Fargate. The main concepts we need in ECS are the following:
- Cluster: a cluster is a home for grouping for our different tasks. Nothing special to configure here, we are just giving it a name.
- Task definition: this is how we define where our container runs and which IAM roles it should use. In our case, we configure it to use Fargate.
- Container definition: this is how we define how to run our container; the CPU/Memory resources, the command to run in our container, the environment variables, the image to use, …
Once we have created our cluster, task definition and container definition, we can trigger a run of our task directly from the console. The Titanic prediction container comes to life on Fargate, does its duties, and then terminates.
Checking the results and the logs of our Titanic prediction container
We just successfully ran our container in the cloud 🚀… or did we?
Let’s check the logs of our container and the actual output of the predictor.
- Logs: we configured our container definition to store the logs on AWS Cloudwatch ; therefore we can find them there, and they look correct indeed! 🎉
- Results: we have configured the container to read and write to a specific AWS S3 bucket, and going there we can have a look at the csv file that was produced. It contains survival predictions as expected 🎉
But… that was not the entire story
Hum, … ok, I took a few shortcuts in the previous section to get to the results faster. In reality, we had to provision a few pieces of infrastructure to achieve what we just saw:
Without going into too much details, all of this can be created in the AWS Console, clicking your way through the different services wizards.
But we opted for a different approach: infrastructure-as-code. We used Terraform (from Hashicorp) to bring up the infrastructure pieces, in a repeatable way:
We defined the various resources needed for our project (IAM roles, S3 bucket, Cloudwatch log group, …) as follows:
Wrap-up: What just happened?
To deploy our Titanic survival predictor container to AWS Fargate, we had to:
- 🏗️ Provision the AWS infrastructure (ECR, IAM, S3, ECS, Cloudwatch) using Terraform
- 📦 Build and publish our image to ECR
- 🚀 Run our container as a Fargate ECS task
- 🪵 Check the logs in Cloudwatch
- 📋 Check the predictor outputs in S3
As mentioned above, you can check out the source code here: https://github.com/datamindedbe/webinar-containers.
Accelerate your journey with Datafy
If you like what you see, you should definitely check out our product Datafy: https://www.datafy.cloud which makes it super easy to containerise, deploy and orchestrate all your workloads at scale. As mentioned at the start of this tutorial, we do believe notebooks have a place in the data products lifecycle, and that’s why, launching in Q4 2021, we offer support for notebooks as well.
🚀 With a single command, Datafy does everything we did in this article!
- 🏗️ ECR repo provisioned automatically
- 📦 Image built and published to ECR
- 🔑 Service accounts provisioned automatically
- ☸️ Container running on managed K8s cluster
- 📋 Logs/Metrics collected in user friendly UI
- 📺 Live logs in the terminal
… and much more!