Torus: A Toolkit For Docker-First Data Science

Applying DevOps best practices to machine learning projects

TL;DR

At Manifold we developed some tools internally for easily spinning up Docker based development environments for machine learning projects. We are open-sourcing them as part of an evolving toolkit we are releasing called Torus (scroll to the bottom for a definition). The Torus 1.0 package contains a Python cookiecutter template and a public Docker image. The goal of Torus is to help data science teams adopt Docker and apply Development Operations (DevOps) best practices to streamline machine learning delivery pipelines.

Docker-First Data Science

It’s no secret that interest in Artificial Intelligence(AI) and specifically Machine Learning(ML) is growing exponentially across all industries. As more engineers enter this popular space there seems to be a lack of de facto standards and frameworks for how work should be done. A new focus on data scientist productivity and optimizing the ML delivery pipeline is just starting to gain some momentum. We need to take what the DevOps movement did for software engineering and apply it to the ML delivery pipeline. This obviously needs a catchy name so let’s call it…MachOps?

Machop: A bluish-gray Pokémon with large arm muscles. Can be found in the wild building open-source ML productivity tools for data scientists and engineers.

New platforms and libraries seem to be popping up daily and it definitely still feels very “Wild-West-y”. This makes it difficult for data scientists to share work and is confusing for new engineers who are just getting started. The dust hasn’t settled yet and we have no idea what will stick, but one thing we are very confident about is:

Docker will play a major role in the Machine Learning development lifecycle standard.

The benefits of using Docker for streamlining the development and deployment of software products is pretty widely accepted. I have yet to meet someone who thinks Docker is a bad thing. However what we have heard from the data science community (and what we discussed internally at Manifold) is generally:

I know Docker will make this easier, but I don’t have the time or resources to set it up and figure that all out.

To help teams start their Docker journey for ML projects we are open-sourcing tools we have built that started as an internal project called Torus. We have seen productivity gains across all of our projects that switched to a Docker-first workflow and hope this will help other teams as well. The plan is to continuously add to the open-sourced Torus toolkit as we build, experiment, and find success with new methods and workflows.

This all sounds good how do I get started?

This initial release of Torus is focused on providing a hassle-free way for getting a fully configured Dockerized local environment set up so you can hit the ground running. We wanted to build a set of tools to make it dead simple for teams to spin up new ready-to-go development environments and move to a Docker-first workflow. So what’s in the box?

Torus 1.0

Using the project cookiecutter and Docker image together you can go from cold-steel to a new project working in a Jupyter notebook with all of the common libraries available to you in less than five minutes (and you didn’t have to pip install anything). After instantiating a new project with the cookiecutter template and running a single start command your local development setup will look like this:

Fully configured out-of-the-box Dockerized local development setup for data science projects.

You can use your favorite browser and IDE locally as you normally would to do your work with the benefit of knowing your runtime environment is 100% consistent across your team and all your environments. Also if you are working on multiple projects on your machine you have the peace of mind knowing each project is running in it’s own cleanly isolated container. Let’s dive a little deeper into what’s happening here:

  1. The ML base development image was pulled down to your local machine from Docker Hub. This includes many of the commonly used data science and ML libraries pre-installed along with a Jupyter notebook server with useful extensions installed and configured.
  2. A container is launched with the base image and is configured to mount your top level project directory as a shared volume on the container. This lets you use your preferred IDE on your host machine to modify code and can see changes reflected immediately in the runtime environment.
  3. Port forwarding is set up so you can use a browser on your host machine to work with the notebook server running inside the container. An appropriate host port to forward is dynamically chosen so no worries about port conflicts (e.g. other notebook servers, databases, or anything else running on your laptop).
  4. The project is scaffolded with it’s own Dockerfile so you can install any project specific packages or libraries and share your environment with the team via source control.

Like the original cookiecutter-data-science we are opinionated in certain areas to provide guardrails for teams who are just getting started with Docker for ML projects. We iterated on a few setups to land on a project structure and workflow that we felt was best for us. Try it out and we would love to hear your feedback or even better submit some pull requests!

What does this actually look like?

Video walkthrough for starting a new project using Torus

So how did we get here?

Today the work of data scientists has grown well beyond the scope of the research and development sandbox and is being embedded directly with existing products or as new standalone products themselves. With these practical applications comes all of the requirements associated with building and delivering any software product in a reliable way.

It is a non-trivial task ensuring your work survives the delivery process.

This isn’t a new problem by any means. A community effort to solve this several years back is what we now think of as DevOps. The core idea behind DevOps is removing the “wall” between development and operations to drive increased efficiency and improve product quality. Specifically the problem is engineering work from development teams would get “thrown over the wall” to the operations team to serve in production with little to no context. This leads to massive amounts of wasted time and unreliable products with bugs introduced very late in the delivery cycle. Fingers start pointing, customers wait longer, and it isn’t fun for anyone. The result was new tools and processes were built to help teams implement streamlined delivery pipelines that could guarantee development/production parity and effectively remove the wall.

This problem has reared it’s head again now in the machine learning space and is only getting worse as demand for AI products continues to grow. How does DevOps change with the rise of data science teams forming in engineering organizations? What does dev/prod parity mean for ML projects? How do ML projects hook into the current deployment pipeline? The pain points we are seeing in the community today feel familiar but also have unique aspects to ML development. One way we think of this is a new “wall” that is killing productivity:

Data Science teams that operate in a silo are faced with two challenges:

  1. They have an internal customer which is another engineering team that needs to reliably take their work to production.
  2. They have an external customer and need to work directly with operations or become their own operations team to deliver their product/solution.

Regardless of which category applies data scientists are becoming more involved with the delivery pipeline and one thing seems to be true:

The demand for “full-stack” data scientists is increasing.

A Machine Learning Engineer (MLE) at Manifold is someone who sits explicitly in the intersection of data science and software engineering. Similar to how new tools and frameworks were developed as part of the DevOps movement there needs to be a toolkit built around MLEs to be successful in removing the data scientist delivery “wall” we see today.

Traditional software companies have benefited from an influx of new tools and best practices from recent attention in the DevOps space. Simply put these solutions answer the question of “How do I take the work I have done on my laptop and serve it up in production in a reliable, repeatable, and efficient way?” At a high level MLEs have the same set of challenges as any software engineer working in a product development team:

  1. Standardized local development environments
  2. Development vs. production environment parity
  3. Standardized packaging and deployment pipelines

In addition certain aspects of the ML development workflow provide a different set of challenges to engineers and data scientists. MLEs typically face the following challenges as well:

  1. Easily sharing development environments and intermediate results for conducting reproducible experiments
  2. Coordinating isolated project environments running multiple notebook servers
  3. Easily allow for vertical and horizontal scaling to handle large datasets or leverage additional compute resources (e.g. for deep learning, optimization, etc.)

By taking the DevOps mentality and looking through a ML development lens we can identify several new areas along the delivery path that need improving. There is an opportunity to build new tools and best practices that specifically empower members of the MLE community to deliver more robust solutions in a shorter amount of time.

Why Docker?

We firmly believe the benefits of using a Docker-first approach to the entire ML development lifecycle will do two things:

  1. Solve many of the local development efficiency issues today.
  2. Will set your team up to benefit from new container-based tools and frameworks in the future.

Docker images running in containers provide an easy way to guarantee a consistent runtime environment across different developer laptops, remote compute clusters, and in production environments. While this same consistency can be achieved with careful use of virtual environments and disciplined system-level configuration management, containers still provide a significant advantage in terms of spin up/down time of new environments and developer productivity. By moving to a Docker-first workflow, MLEs can benefit from many of the significant downstream advantages in the development lifecycle in terms of easy vertical and horizontal scalability for running workloads on large datasets and ease of deployment/delivery of models and prediction engines.

We’re just getting started!

There are several tools and features that we are currently working on and will be adding to Torus releases in the very near future. An example of some functionality we are working on:

  1. Easy spin up/down of remote compute for ad-hoc vertical scaling to circumvent memory and compute constraints on your laptop.
  2. Easy spin up/down of a remote cluster to leverage distributed processing with Dask for feature engineering and training on an ad-hoc basis.
  3. Integration with popular BYOC (Bring Your Own Container) platforms to easily leverage features they offer for training and highly scalable deployments (e.g. AWS Sagemaker)

Let us know if there are other use cases you would like to see added to the Torus toolkit!

There is a lot of exciting activity going on in the MLE toolkit space and it’s easy to forget that before even considering a higher-order platform or framework you need to make sure you do the necessary housekeeping to make sure your team is set up for success.

Get your house in order now and you will thank yourself later.

Moving to a Docker-first development workflow will make everyone’s life easier that is involved with the delivery pipeline and that includes your customers. This is a critical first step that can’t be forgotten or you will be stuck paying off the technical debt as you get too far down the road.

You have to walk before you run…but did you remember your shoes? 🤔


What does Torus mean? (by Sourav Dey)

Why did we name the project Torus? Because a torus is a special manifold. What is a manifold, you ask? A manifold is a topological space that locally resembles Euclidean space near each point. The goal of machine learning is to learn the manifold upon which the data lies. That’s why we named our company “Manifold”. A torus is a special manifold that is defined as the Cartesian product of two circles: S1 × S1. It is a compact 2-manifold of genus 1. Sound complicated? It’s not. A torus is more commonly known as a donut 😃

Mmm. Torus.

About Manifold

Manifold is an AI engineering services firm. We accelerate AI and data roadmaps to create business value and positively impact lives. We have experience in a range of areas of AI, including computer vision, natural language, signal processing, AI-at-the-edge, data science, and data engineering.

Check out our blog to see what else the Manifold team is working on!

Questions? Comments? Reach out at ang@manifold.com!