Do your EDA notebooks use containerize deployment?

Photo by Ian Taylor on Unsplash

Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on the data with the help of summary statistics and graphical representations. The most common discoveries made are about the completeness, patterns, and anomalies. Sometimes it is also used to test the hypothesis and assumption made on the dataset.

Why it is important to organize EDA efforts?

Solving a complex machine-learning problem requires a massive amount of effort in data investigation. These efforts require a lot of code to summarize and preprocess the data. You probably want this process to be as portable as possible. In other words, it can be run as many times as you like, even on different machines on-premises or cloud.

Unfortunately, it is often seen the code works fine on a local machine and customized virtual environment but gives errors during runtime. It can be due to different versions of libraries installed on the host machine.

To deal with this problem a combination of docker, terraform, and GitHub actions will help to complete the CI/CD loop.

Let’s summarize the tool support

Docker

Docker is an open-source platform that allows you to develop, build, ship and run and manage isolated applications. The principle is to develop an application that contains the written code and all the context to run the code dependencies and their versions. For example, when you wrap your application with all its context, you build a Docker image, which can be saved in your local repository, in the Docker Hub or GCP artifact registry.

To get started with Docker, please, check this documentation.

GitHub Actions

GitHub Actions is a continuous integration and continuous delivery (CI/CD) platform that helps to Automate your build, test, and deployment pipeline. Actions workflows to be triggered when an event occurs in your repository build and test every code change, or deploy merged pull requests to production. Workflows contain one or more jobs that can run in sequential order or parallel as per the definition.

To get started with GitHub Actions, please, check this documentation

Terraform

Terraform is an Infrastructure as a code (IAC) open-source software application that lets you define both cloud and on-premises resources in configuration files that you can version reuse, and share. A consistent workflow like GitHub Actions is used to provision and manage all of your infrastructures throughout its lifecycle as defined in the human-readable configuration files. Terraform can manage low-level components like containers, compute, storage, and networking resources, as well as high-level details like DNS entries and SaaS features.

To get started with Terraform on GCP, please, check this documentation.

Google Cloud Platform (GCP)

We will be considering GCP to elaborate on the example of deploying containerized notebooks. Mainly required services for this deployment are Vertex-AI managed notebooks or user-managed notebooks and access to Google Artifact Registry.

Finally, A service account for the GitHub Actions workflow will be required with appropriate permission to access all listed GCP services.

Containerize and deploy

The success of containerized deployment depends on how well the files are organized. and how consistent workflows are written.

First of all, You have to prepare a list of notebooks that will contain the different EDA modules, This will also help you to analyze the dependencies required to run the EDA modules. Now let’s add the identified dependencies to the requirements.txt file which will allow us to install all the python modules needed to make our EDA work. The best way to write requirements.txt is to tag the dependencies with the version and make the file useful when writing the Dockerfile.

sample requirements.txt file

Write the Dockerfile with the base image, The base image could be a pre-build image or a custom image. Also, The container instructions for example installing the python modules and copying the EDA notebooks to the image.

a sample Dockerfile

Create the terraform modules and environment scripts with configuration to create either user-managed notebooks or google managed notebooks with a custom container image, essential virtual machine, and network configuration.

a sample user-managed notebook instance terraform script

Write the GitHub actions workflow to push the container image to the google artifact registry. Lastly, add the terraform workflow for provisioning the resources using the configuration file with the definition of terraform module for google notebooks runtime (google managed notebooks) or google notebooks instance (user-managed notebooks) using the docker image from the artifact registry.

a sample terraform github actions workflow

One important thing to note here google managed notebooks comes with the auto idle shutdown option.

A similar effort will be required if the same EDA is needed to deploy on a Streamlit app! Of course, there will be additional services will be required on GCP.

Enjoy Exploring!

References

  1. https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/notebooks_instance
  2. https://github.com/marketplace/actions/hashicorp-setup-terraform

--

--