Delivering ML Products Efficiently: The Single-Node Machine Learning Workflow

Published in

Udemy Tech Blog

8 min readFeb 17, 2021

Introduction

Distributed machine learning is all the rage these days, but here is a dirty secret: you may be better off without it sometimes! If you are responsible for delivering ML products for your organization, you should definitely consider making use of single-node machine learning in your toolset. As vertical scaling is becoming more and more affordable, a much larger subset of business problems can feasibly be solved with this approach. If done the right way, single-node machine-learning can help you deliver products with aggressive timelines without sacrificing model or system quality. In this blog post, we will discuss how data scientist familiarity, and affordable scalability, make single-node machine learning an effective combo for delivering ML products. We will also present the high-level components required to successfully apply this approach in the real world.

The What and Why of Single-Node Machine Learning

As the name implies, single-node machine learning refers to machine learning which makes use of algorithms that can scale well on a single machine (a.k.a. vertical scaling). Contrast this with distributed machine learning which makes use of algorithms that can scale well without bound across a network of machines (a.k.a horizontal scaling).

Now, you might think to yourself, if we can make use of algorithms or tools that scale horizontally without bound, why would we ever want to make use of algorithms or tools that can only scale well vertically? It turns out, there are some very practical and compelling reasons to consider the former. To start, most data scientists are well-trained with libraries and tools that are designed to be maximally fast and efficient on single nodes, so they can usually be much more productive with these tools. These algorithms and tools are generally more efficient and less complex than their distributed counterparts. This combined with the fact that vertical scaling has become much more affordable means that single-node machine learning can be a heavy hitter in delivering ML products.

Let us dig a bit deeper into the claims made in the above paragraph.

Most data scientists are very comfortable with doing research and development on their local machine using common single-node libraries such as scikit-learn, XGBboost, and even deep learning libraries like TensorFlow or PyTorch. Hopefully, this is not a very controversial claim since these libraries are the de-facto industry standard for most data scientists at this point and Python laptop development is how most data scientists start on their journey. An important note here is that the phrase “single-node libraries” should by no means be interpreted to mean “not parallelized”. In fact, most of these libraries implement sophisticated algorithms optimized under the hood in compiled languages like C and C++ to make efficient use of all the compute resources of a single node; they are just not designed to scale natively horizontally across multiple distributed nodes.

Generally speaking, distributed algorithms and systems are inherently complex and hard to master, and the subtleties that they introduce makes development and debugging difficult. We suspect this may be a bit more of a controversial claim because in recent years, technologies like Spark have become more and more user-friendly by needing less esoteric tweaking from the end-user to work efficiently. However, while we concede that the barrier of entry has been lowered for utilizing distributed systems, it is still significantly more difficult to reason about and debug issues when things go wrong. As such, it is probably fair to say that as long as a problem can be solved without resorting to the complexities of distributed systems, that would be preferable.

There exists a large subset of business problems that can be solved using single-node machine learning workflows. Of course, there are some web-scale data problems that can only be addressed with distributed algorithms and systems. However, as argued by Xavier Amatriain, “Most practical applications of Big Data can fit into a (multicore) implementation.” While horizontal scaling technologies like Spark or Map Reduce are intended to simplify scaling challenges, the growing on-demand availability of high capacity, cost-effective cloud computing has made vertical scaling a viable option. In recent years, the compute capacity of single nodes has increased significantly and their cost dramatically reduced. At the time of writing this article, you could rent a machine with 96 cores and 768 GB of memory from AWS for as low as $8.46 an hour (and perhaps lower if you commit to certain usage and/or make use of spot instances). What is more impressive is that you get billed by the second, so it is extremely efficient to make use of such instances. With this amount of memory and compute power on-demand, you can efficiently build models that can train using tens or even hundreds of millions of examples.

Components

There are three essential components needed for effectively implementing single-node machine learning in practice. In fact, these components are not specific to single-node machine learning; they are applicable to any workflow that is aimed at delivering ML products in a timely manner and with limited resources. You will notice that all the components are about opinionated approaches and standardization. This makes sense because if we want to deliver products fast and with a high-quality bar, we have to efficiently apply best practices regardless of whether the person delivering the product has a strong engineering background. We will discuss each of the three components in the sub-sections below:

Component 1: Standard Cloud-Based Development Environment

As mentioned previously, most data scientists feel comfortable with local laptop development. However, there are many drawbacks to fully embracing this approach, some of which are listed below:

The setup process will typically end up being manual and time-consuming.
Laptops are not convenient for long-running jobs as they are frequently turned off
Laptop compute power is not very cost-effective and does not scale
There are security and regulatory risks associated with allowing code and user data to rest on a machine that is not physically secured.

By having a cloud-based development environment that can be instantiated using common infrastructure-as-code approaches (e.g. Terraform or AWS CloudFormation), we are able to roll out a consistent experience across all our user base efficiently. At Udemy, we make use of SageMaker managed Jupyter notebooks which deliver an experience that is very similar to local laptop development without any of its drawbacks.

Component 2: Standard Coding Practices and Architecture

As discussed previously, our data scientists come from a broad range of disciplines and therefore have varying levels of experience with writing and maintaining production-level code. However, what they all have in common is a willingness to learn and the desire to improve. To help data scientists become more self-sufficient in delivering projects from research to production, we are making a significant investment in enablement. These efforts can be broadly organized into the following sub-components:

Documentation and training on engineering best practices: everything from a comprehensive “pull request guide” to documentation and training on “coding best practices” which are derived from our comprehensive PR review process.
Template projects for common use cases like regression and classification from tabular data, text processing, and clustering, etc. These projects are structured in a way that encourages modularization and requires the explicit definition of library dependencies for execution.
Reusable software components (a.k.a. utilities) for common tasks like caching, formatting, multi-processing, DataFrame transformations, etc.

Of course, this is not always a smooth or straightforward process, but we recognize that by this upfront investment in training and nurturing of our data scientists, not only will we benefit from this enablement in the delivery of production projects, but we will also have better tools to power our research and exploration.

Component 3: Standard Execution Environment

One of the most effective ways to gain efficiency in the productionization of ML products is to ensure the development and production environments mirror each other closely, and that they are independent across projects. The former is important to avoid surprises when moving code from development to production, and the latter is important to isolated changes in one project from potentially affecting other projects.

This can be achieved by containerization of the execution environment. Containerization technologies like Docker provide a way to define a controlled environment with deterministic versions of all software installed. At Udemy, we make use of SageMaker managed instances for running our single-node workflows. The SageMaker Python SDK has high-level abstractions for launching managed instances for various tasks ranging from generic processing to multi-node hyper-parameter optimization. Under the hood, these abstractions make use of Docker with baseline installations of common machine learning libraries. You can of course provide your own custom Docker definition for these tasks if you wish. For our purposes, we have developed some deployment tools which allow us to launch and customize the base instances on the fly by pip-installing the requirements of the specific project that we want to run before running its entry-point.

Workflow and Architecture Diagrams

By now, hopefully, you have a sense of the motivations behind, and the components of, our single-node workflow. So without further ado, here are the actual steps in the workflow:

1. Log into AWS dashboard and start your Jupyter notebook instance.
* Each data scientist gets their own small notebook instance with a clone of our data-pipelines repo which contains all our machine-learning projects including template projects. Instances are set to shut down after a few hours of inactivity to save costs.

2. Exploratory data analysis: use the various pandas connectors for accessing and transforming samples of the data which rests in S3.
* For queries that take a long time or require many intermediate steps that would not fit into memory, make use of the Hive connector and create and schedule Hive ETL jobs that adhere to best practices (*the details of this can be a whole different blog post!).

3. Exploratory model analysis: use template project notebooks and utilities as starting point to iterate on various models and approaches:
* Use the notebook as an IDE to experiment.
* Clean up the code and use %%writefile magic to write modules to file.
* Debug locally on sampled data (interactive).
* Use deployment scripts to launch larger remote node from notebook on full data (5+ min turn-around).
* Tune performance and inspect results until happy.

4. Create a pull request to merge changes for production:
* Before raising the PR, do a final cleanup of the code and make sure it adheres to best coding practices with detailed docstrings and tests.
* A more senior engineer will review the project and contribute back to the coding best practices doc if necessary.

5. Schedule code for regular runs in production.

The figures below show the basic architecture of the systems that enable this workflow. Note that the production workflow is completely segregated from the development workflow and can only run code that has been properly vetted and merged into the master branch.

Conclusions

In this article, we discussed the importance of single-node machine-learning workflows in delivering ML products quickly and efficiently. We laid out an overview of the main components required to achieve this goal and showcased the workflow steps. Here at Udemy, we believe in using the right tool for the right job and we have been able to use this workflow to significantly speed up the delivery of many impactful projects. Of course, there are projects which have data and compute requirements that are not suitable for this workflow. In future articles, we will present the workflows we have built around Spark to handle use cases that require streaming analytics and dealing with web-scale data.