Azure Batch AI: A Data Scientist’s Quick Start for Productizing Machine Learning
How do you transition from training models locally to training models in production? Most data scientists are really good at taking a lot of data, cleaning up the data, selecting a fancy machine learning algorithm for training, tuning parameters in the model, and displaying the predictions. All this and more can be done easily locally. A common pitfall for data scientists in industry is not necessarily how to train the most accurate model, but rather how to scale a machine learning infrastructure for production. Finding the proper solution is hard and often times requires data engineers, software engineers, and DevOps to become involved in infrastructure design. This is of course a time consuming process, since we are orchestrating people rather than machines.
At Seismic, we wanted to start deploying machine learning models at scale and initially without too much dependence on other individuals. Azure Batch AI is the quick start we are exploring for training models at scale. Azure Batch AI abstracts away the lifecycle management of compute clusters, meaning that we can spend more time tuning out parameters and serving up predictions rather than worrying about how to execute the jobs across multiple machines.
Understanding Your Data Pipeline
One of the unique aspects of data science at Seismic is that we are a fully multi-tenant environment, meaning each customer’s data is isolated. This also means that when we want to deploy a new machine learning model, for example, we are actually deploying huge numbers of them in parallel, one for each customer. The underlying infrastructure is also globally distributed for compliance reasons.
Colin Jemmott. Source: Data Science at Seismic
At Seismic, real time data streams are collected, business objects (ex. user activity, user information, etc.) are created from the data streams, and SQL databases house the proper business objects for our different services. As data scientists, we have access to data at each step in our data ingestion pipeline, but typically utilize data from the SQL databases for training. As Syed Sadat Nazrul mentions, when working with data from many companies it is wise to build models on a tenant-by-tenant basis in order to avoid accidental transfer learning that could produce information leaks. Furthermore, to ensure that company personally identifiable information and information is protected, any stored intermediates from training (ex. user-user similarity matrices for user recommendations) are hashed and securely stored. Still when it comes to actually training the models across our many tenants, it does not make sense for the data scientists to locally train models for Seismic’s growing customer base. We need a cloud solution which runs large parallel jobs (many models across many customers) without requiring data scientists to invest largely in resource provisioning.
Running Jobs with Azure Batch
Azure Batch is a service for running large-scale parallel and high-performance computing batch jobs. Some of the advantages of the service include:
- Azure manages the resources and monitors the jobs. So your models are up sooner without having to worry about installing cluster and job scheduler software.
- You only pay for what you use. There are no additional charges for using Batch. You only pay for the VMs, storage (file share, blob, etc.), and networking.
- You really only pay for what you use. You can increase the number of nodes per cluster if you want to run several jobs in parallel. In our case, if we want to train a classification across our customers once a day we can do so in parallel and get the trained models out faster. Also, when the jobs are completed and no more are queued, Batch can autoscale your cluster to 0 nodes.
The AI in Batch AI
Azure Batch AI is built on top of Azure Batch, thus allowing data scientists to use the resource provisioning provided through Batch to train models at scale. Batch AI is particularly designed for training deep learning models at scale by providing GPU and deep learning (ex. TensorFlow, Keras, and CNTK) framework support. But, you can also deploy simpler models using a docker image of your preference. In our case, we pulled in a Python 3 docker image from image registry, installed the necessary packages for our training jobs, and utilized the Custom Toolkit settings to complete container configuration. We created one cluster with several nodes to train our simple recommender in parallel for multiple customers, and autoscaled our cluster to 0 nodes once the train job was completed. As of July 2018, Azure Batch AI is in preview and some issues we came across include:
- Credentials used for authentication are not obvious.
- No recipes using Custom Toolkit. We used our own docker image and were able to determine which parameters for the settings were necessary.
- Job errors/status logging is not specified intuitively.
- Need more clarification with what the relative mount point is doing.
For a deep dive of how to set up Batch AI for training within a multi-tenant environment and some of the caveats we worked around, check out this blog.
Of course, Batch AI is not your only choice as it is not our only choice as well.
If you are interested in serving up light weight applications like this fake news detector, which classifies news URLs as fake or real (try out articles from The Onion), you will probably not need to use cloud services. In Hamza Harkous’s blog, he provides instructions for training and deploying models for light weight applications using Flask and a WSGI server.
If you are exploring other cloud based services for deploying machine learning models at scale, consider leveraging AWS lambda for model deployment, S3 storage for data and model storage, and AWS model frameworks, like MXNet. Check out Sunil Mallya’s blog on scale predictions with AWS lambda and MXNet, if you are interested in utitilizing AWS as your cloud service.
Currently, we do have some of our production machine learning in Kubernetes. Josh Lane wrote a fantastic blog about scaling machine learning models at scale on Azure. In the case that you are considering using Kubernetes for your cluster orchestration I would highly recommend giving the blog a read, here are some key takeaways:
- Batch AI is designed for running ML jobs realized as Docker container instances and abstracts away the lifecycle management of compute clusters. Kubernetes requires a lot of heavy lifting in terms of resource management and provisioning.
- Currently, Kubernetes on Azure lacks node autoscale support for AKS clusters, so cost-consumption becomes a major concern.
- Achieving 99.9% SLA on Kubernetes requires more cost and effort (i.e. an expensive premium storage), while for Batch AI this is available out of the box.
At Seismic, we wanted to deploy machine learning at scale fast and with minimum reliance on data engineers, software engineers, and DevOps. If you are a data scientist in a similar situation, you might want to investigate Azure Batch AI as your solution. Batch AI’s quick start and recipes allow you to train at scale and in parallel fast. Check out how we set up Batch AI for our model training here. Note, Azure Batch AI is in preview at the time this blog was written. If you have any questions or thoughts on this topic, feel free to reach out in the comments below or through Twitter.
Goal: Set up and run Custom Toolkit training jobs in a multi-tenant environment using Azure Batch AI.medium.com