Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Notes from Industry

Machine Learning Scaling Options for Every Team

11 min readJul 27, 2021

--

Press enter or click to view image in full size
Photo by Kamil Pietrzak on Unsplash

Machine learning has myriad applications, but integrating machine learning into a software application comes with substantial challenges! Machine learning(ML) algorithms are complex, can require large binary parameter files, and are computationally intensive. Additionally, there is often value in improving machine learning approaches as new data and technologies become available which means that your approach for incorporating ML in your application must be flexible.

Most importantly, the skills needed to build and train great machine learning algorithms are not the same skills needed to build and scale great software (though there is overlap). To get the best results, ML practitioners need the ability to train and iterate on algorithms with a wide variety of tools purpose built for that task. Software engineers aim to produce applications that are effortlessly deployed, scale effectively, and produce reliable results. These requirements mean that a team that applies machine learning at scale must optimize against several difficult objectives.

I’ve personally encountered many sides of this problem. When I was a software engineer working on an R&D team, I wanted to make sure that algorithms worked as well in production as they did in notebooks and scripts, and that integrating them with the product was cost effective. I thought a lot about concerns like automated testing, API definition, and scaling. As a data scientist, my focus was on training not just accurate but also generalizable models that would inform our users’ actions and that would be hard for our competitors to replicate. Now, I lead teams that are concerned with all of these goals and see how both sets of concerns can add to complexity that might change the return on investment of a feature from “game changing win” to “not worth it”.

Fortunately, there are several approaches to scalably integrate machine learning algorithms into applications. Deliberate technology choices can make it possible for teams with different skill mixes to build ML applications. This article will introduce a non-exhaustive list of approaches to scalably executing machine learning models.

First though, let’s consider an example service to put the upcoming solutions in context.

Image By Author. Conceptual machine learning service architecture.

The above diagram shows the flow of information between a few components of an application.

  • API endpoints used by other applications to trigger ML processing or retrieve predictions.
  • Application logic which determines how to handle the requests. This could be by retrieving cached predictions or by causing new ones to be generated.
  • Storage which may contain data needed by ML algorithms to make predictions or store the final calculation results.
  • ML processing which contains machine learning algorithms and some way to execute them. We’ll focus on this component here.

The actual connections between the shown components can vary from this arrangement. For example, an application could rely on preprocessed results and provide no path for an API to interactively trigger model processing.

Even if the actual ML inference is lightweight enough to be executed on the same hardware resources as the application without encountering bottlenecks, it is still worth thinking of machine learning’s requirements on their own since there are some unique options to serve them. Additionally, several of the options outlined below could work for a variety of application scenarios. As we’ll see, there are quite a few ways that the ML processing component can be incorporated, some of which require minimal effort!

Use Someone Else’s Scaled Model

The easiest way to scale something is to use something that already scales! In this paradigm, all of the work to train, validate, and scale processing of the ML algorithm has been done by someone else and is available as a commodity.

It may be alluring to consider building a fresh generalizable AI solution, but unless coming up with a better solution is part of your core value proposition, can you really beat the return on investment of buying someone else’s solution? Additionally, do you have enough time from talented data scientists and high grade engineers to both perform R&D and come up with a scaling solution?

In many domains, top performing models are readily available as a commodity service. For example, image and speech recognition services are available from all of the major cloud providers. You can often pay a lower amount for a solution that is near state of the art and already available as a scalable service compared to the investment required using your own team. Solutions are not available for all domains, but the biggest cloud providers (AWS, GCP, Azure) collectively provide API based machine learning for many areas.

To illustrate the value of going with a prebuilt solution, imagine that you want to build an image sharing app feature that determines whether an image is of a food item or not. Amazon Rekognition can classify images with several food related labels, so you could use the presence of such a label to solve your problem.

Rekognition’s cost at this time is $1.00 per 1000 images labeled. For a million images that’s $1000. Since interaction is through Amazon’s APIs, the ML portion of your application’s scaling needs are fairly light and can likely be handled by whatever means you chose to scale the rest of your application. This means ML specific development is minimal. How many hours of data scientist time can be paid for with $1000? And, how many hours of software engineering time can fit into that budget to scale whatever the data scientist comes up with? Most likely, not very many. A more fruitful use of both skill sets would be to evaluate how the results and methodology of the prebuilt solution fit into your application, then simply integrate calls to the highly scalable, prebuilt API.

Machine Learning As a Service

After a prebuilt solution, the next most automated way to get machine learning into a product is to use “machine learning as a service” providers or perhaps more accurately “machine learning engineering as a service” providers. The goal of this class of solutions is to minimize the heavy lifting of getting a machine learning algorithm from prototype to scalable solution.

These services typically provide a pipeline that helps data scientists to take a model from R&D through to production availability. As an example, DataBricks provides a managed MLFlow environment. This service allows data scientists to train and validate models, choose the best performers, then automate deployment to a processing solution like Apache Spark or as a scalable API, all with minimal software engineering.

The main advantage of this class of solutions is a high degree of automation of machine learning engineering tasks and often simplified implementation of important data science practices like data versioning and model validation. When compared to using a prebuilt solution, design choices still exist around data format and deployment, but compared to the rest of the solutions, the amount of code and configuration needed to get to functioning machine learning applications is reduced.

The major cloud providers each have their own versions of MLaaS, but there are alternatives in this space as well. Besides, DataBricks other paid providers include H20, and DataRobot. It is worth looking around for a solution that works well in your context before considering implementing the research-to-deployment pipeline.

Batch ML

In some organizations the line between data science, data engineering, and machine learning engineering may be blurred with the same team serving all three functions. Data engineering teams excel at building systems that process large volumes of data at regular intervals.

Mature, flexible, general purpose tools exist to enable scalable batch processing. If an application involves regularly processing large sets of examples, and event or interactive responses aren’t required from machine learning processes, then these tools can treat machine learning processes as another type of data transformation and scalably generate results.

One example of a batch processing tool that works well for orchestrating machine learning workflows is the open source Apache Airflow. This python framework treats code as configuration, allowing developers to string together Python operations in a directed acyclic graph(DAG). Each node in the graph is a different processing step and individual task executions are farmed out to autoscaled processing workers via a queuing system (see Queued Processing below). Integrating ML processing in this scheduled way allows integration with additional batch processing resources such as Spark clusters which may have high spin up latency.

A major component of batch processing systems is the ability to schedule regular executions of predefined workflows, often with extra functionality available to deal with the common use case of periodically updated inputs. However, systems like Airflow also offer event driven capabilities with workflows being triggered via a REST API.

Many batch processing frameworks exist beyond Airflow, with AWS, GCP, and Azure also offering solutions. As alẃays, it’s worth looking for vendor managed deployment to minimize operational overhead. For example, Google offers Airflow via Cloud Composer.

Serverless ML

Consider the process of setting up a traditional cloud based web service and scaling it. You’ll need to decide on a method to manage high volumes of requests (like Queued Processing below), write API code to provide an interface to your service, and manage computing instances with the right code and dependencies deployed on them.

The serverless paradigm has emerged which simplifies all of these decisions by allowing a “function as a service.” In other words, engineers and data scientists only need to concentrate on what is needed to specifically run a coded function that handles model training or prediction. This is limited to designing the API interface, implementing the logic, and packaging the dependencies. Given that these basics are defined, a serverless framework manages providing an API endpoint and scales resources to handle requests to that endpoint. By allowing ML application developers to skip several operationalization steps for a machine learning function, deployment can happen more quickly and scaling can be achieved more reliably without requiring the skills and time to configure a more hands on solution

That said, serverless frameworks do currently have drawbacks that are especially relevant to machine learning applications. Since computing resources are being provisioned dynamically on hardware the developer has little control over, there may be lag time in serving a request as new resources spin up. Also, serverless frameworks tend to limit the size of dependencies that can be packaged alongside a function. With machine learning models often relying on numerous dependencies including data and model parameters, this can meaningfully reduce the number of techniques at a data scientist’s disposal. Finally, not all varieties of hardware, programming languages, or dependencies may be supported by the serverless framework. All of these are potentially roadblocks in the diverse and rapidly evolving world of machine learning.

As a concrete example of limitations, AWS Lambda function layers are currently restricted to 250MB. This means that your total dependencies must fit in that much space. That rules out, for instance, using the latest version of Keras + Tensorflow. It also means that model parameter files, which can easily go over that size, would have to be lazy loaded at runtime from other storage, such as S3. However, Lambda and other serverless frameworks allow usage of containerized applications or allow for quota increases which loosens some of these restrictions.

Despite limitations, many machine learning tasks work within the bounds of serverless frameworks, so this scaling technology deserves consideration.

Queued Processing

A widely practiced means of scaling is a compute cluster listening for tasks on a queue. In this paradigm, a request for processing is placed on the queue. Worker processes poll the queue for jobs and process them as their resources become available. This allows a pool of custom task processors to crunch your ML processing jobs.

Press enter or click to view image in full size
Image By Author

As you can guess, this method allows for scaling of computation for machine learning applications with prediction or training being performed by a cluster of workers. However, though overall throughput can be high, this method of scaling has a few drawbacks. By default, queue based systems are asynchronous. Queues are typically a one way communication mechanism. If synchronous behavior must be emulated, this needs to be implemented using a system like callbacks.

Designing a queue based system is also not a “batteries included” solution. There are a wide variety of task queue implementations available, but the systems to handle requests, maintain the queue, and listen for and process requests on workers must all be designed and implemented. Cluster scaling policies and hardware requirements need to be considered. Doing all of this successfully will require time from seasoned engineers with system configuration knowledge, and this skillset is far removed from the scientific process of designing machine learning solutions. Additionally, there will inevitably be additional maintenance with any self-managed scaling solution to keep it secure and able to handle changing work loads.

It’s worth considering what components of a queue based scaling solution are available off the shelf to minimize implementation time. Several task queuing solutions exist. If Python is your cup of tea, celery is a popular option. It’s also worth reading up on what scaling and load balancing solution is available from your compute provider. Containerization is a natural fit for implementing and scaling these systems on services like Google Kubernetes Engine.

Despite the upfront work, the advantage remains that a queue based system can be configured to meet most processing needs. Many of the approaches listed above are ultimately abstractions over queuing systems. Need a GPU worker node that stores a multi-gigabyte deep learning model in memory and uses the latest libraries in your favorite language? If you’ve got the know-how, you can use a queue based cluster to manage the load.

ML In The Browser

All of the solutions discussed so far have focused on backend methods to run a machine learning model. In browser machine learning is also an option. Tensorflow, the GoogleBrain neural network library, now has a Javascript version. This means that neural nets can train and predict on a user’s device and that engineers familiar with front end technologies can take part in building the system.

One obvious advantage to this approach is that backend computation resources are no longer a bottleneck. Additionally, inference can happen without any network round trips. However, that means that the model design must now be mindful of computational resources on the end users’ computers. Also, while it is possible to serialize stored model parameters for use in the browser of a user, this in practice limits the size of the model to what does not overtax the user’s network connection and hardware.

Finally, as far as I know, the only Javascript machine learning library so far is Tensorflow.js. While neural networks are powerful, they still may not be suited for every task ML practitioners can think of, limiting applications. If neural nets aren’t what you need but browserside ML still seems like a great idea, there are a few math libraries available like numeric and numbers.js which could simplify development.

How To Choose

Clearly, there are options for managing your machine learning needs. However, there isn’t a single solution to meet all cases. I’ve attempted to present choices in more or less increasing complexity, starting with integrating an existing scaled solution through implementing custom queue base scaling. This is the hierarchy I would consider when choosing how to incorporate ML into a product’s backend. Running client-side models is also an interesting new option.

To choose among the options, consider your use case. Does your application require user interaction with machine learning? If so, batch machine learning might not work but scaling another option to have acceptable response time or using in browser machine learning might be a good option.

Also consider your timeline and team. The more the solution is managed by an outside provider, the lower the effort and narrower the skill set required to start seeing benefits and the lower the expertise required to get it into your application. For example most data scientists and developers should be able to call an existing REST API to run analytics. On the other hand setting up a queue based ML processing workflow requires architecture, system deployment, and software development skills along with data science!

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Ryan Feather
Ryan Feather

Written by Ryan Feather

Director of Data Science @ GridX|Sustainability Nerd|ex-Shopify

No responses yet