Machine Learning Platform to Train and Serve Models Completely Automatically

How Spoke deploys thousands of customized ML models!

Published in

Spoke Engineering

10 min readSep 18, 2018

An increasingly large number of products use Machine Learning (ML) to deliver a great user experience. With consumer-tech companies like Google and Netflix providing highly personalized predictions in the form of query suggestions, movie recommendations and so on, we believe that the users will come to expect a perfectly customized predictive experience in their workplace software as well. That being said, let’s consider what happens today in a “standard” ML training and deployment setup:

Model training is done offline and the models are static. You collect all the task data apriori; train a model offline; verify it on validation data and call the job done.
A constant number of models are trained and served. The number of trained models is roughly equal to the number of different ML tasks (e.g. one model for sentiment detection, one model for object recognition, etc.) ~ O(1) for all practical purposes.
The deployment process is manual. Since the models do not change, you can manually start a server and then serve all the models for eternity.

This “train once and forget” methodology does not work for ML-based products which must keep their models up-to-date on the latest user data and add new ML models to their service whenever a new client signs up.

We have built an ML platform at Spoke that completely breaks free from this paradigm and has the following properties:

Dynamic model training in real time from user interaction data so that model predictions have the latest context.
Model auto-scaling i.e. model training and deployment happens automatically for each new client without any manual input.
Capacity to automatically train and serve thousands of models, with multiple customized models per client.

Our ML platform is built entirely with open source tools and cloud services. The architecture of our platform is broadly applicable to anyone looking to automate their ML stack for serving up to a few 100k clients.

To goal of this introductory blog post is to describe the architecture of Spoke’s ML platform from a 50ft view. I will describe the various components and how they are connected but will save the technical details of each component for future posts. This post is divided into two parts:

To set up the background, I will describe an example of how Spoke uses ML in automatically triaging workplace requests to right teams. I will also briefly describe how we collect training data for this problem.
Then I will dive into the heart of this post, describing our training and serving stack consisting of training data storage, model training, and serving. I will discuss what frameworks and libraries we use along with the reasoning behind choosing them.

One of the ways Spoke uses ML is in automatically triaging workplace requests to the right team in a company (see figure below). Assigning requests to teams, or team triaging is a multi-class classification problem. Our clients often use our system to support an arbitrary set of teams and so team triaging for each client poses a separate multi-class classification problem.

Figure 1: Team triaging choices for two different clients

Team Classifier Spec

Here is a brief description of the input, output, and training data for the team classifier.

Input: Each incoming request to the team classifier can be represented as x=(text, metadata) where text is the string form of the request, and metadata is the context around the request.

Outputs: The team classifier for Client1 predicts the label from Client1’s customized set of teams (labels.) These labels can dynamically increase or decrease as the client creates or deletes teams, hence the team classifier must be trained dynamically, reacting to the latest client events.

Data generation for Team Triaging: We don’t require clients to provide any training examples for team classifier before using Spoke. Spoke learns directly from client interactions and trains the model. We consider every example where a piece of text is associated with a team as training data. Examples include:

The name, description, and keywords associated with a team are used as positive training examples for that team.

Figure 2: Team description is used a positive training example for that team.

2. When Spoke cannot triage a request to the right team or makes a mistake and routes it to the wrong team, the user can manually (re-)assign the request to the right team. This request is then used as a positive training example for the selected team.

Figure 3: When a request is assigned to a team, that request is used as a positive example for that team.

Our system logs all of these events which are produced as a natural side effect of the users interacting with our system (and hence do not burden the users with providing explicit training data.)

We have created an ML platform from open source tools and cloud services that enable dynamic training of customized models. Our platform enables digesting the latest training information and reflecting them in our predictions in 15–30 minutes.

Overview

Here is a quick overview of the sequence of steps that take place in our system. Also, refer to Figure 4 below.

User interacts with Spoke’s ML servers which generate training examples (x={text, metadata}, y=label).
Spoke’s ML servers decision servers log this event using a logging library that exports the event to Elasticsearch.
Elasticsearch logs and indexes the events.
The ML Master job determines which clients are eligible for new training and queues all such client ids and associated information in the training queue. The job is picked up by worker machines that train new models in Google Cloud ML. Google Cloud ML reads training data from Elasticsearch and stores the models in GCS.
TensorFlow serving automatically picks up the latest version of the model from GCS and serves it within a few seconds.
When a user interacts with Spoke, Spoke’s ML server queries TensorFlow server and the app database to predict the final team for the incoming query.

Figure 4: Spoke’s ML architecture for Team Triaging.

Now let us get into the details of each of these components.

Logging

We log all of our training events in a central data warehouse. To determine what data storage platform to use, we settled on the following requirements:

Schema-less so we can throw in more fields to log without having to update the schema every time. Since this is not our application DB, we have a higher tolerance for validation errors and can be more loose with our schema.
Indexed on as many fields as possible to enable different kinds of data analyses in the future — analyses that we have not yet foreseen.
Returns training events matching any given criteria within a minute if not in seconds.
Scales up to a few tens of thousands of events per training data generation query.

In hindsight, the most important of these criteria is 2 — you want your training log data store to be as flexible as possible because you cannot think of all possible data analyses your data scientists will perform. We chose Elasticsearch as our training event warehouse as it meets all of the above criteria. Additionally, Elasticsearch + Kibana provides a fantastic way of browsing the events which has been a boon to offline debugging and error analysis. For exporting our logs we have used a Fluentd plugin in the past, but have now switched to a logger that directly connects to Elasticsearch.

… you want your training log data store to be as flexible as possible because you cannot think of all possible data analyses you will need to perform.

Libraries/services used: Elasticsearch, Fluentd, python elastic logger.

Training

Classifier training is performed on Google Cloud ML and is orchestrated by a queueing architecture as follows.

A dedicated machine (“master”) runs a process (as a daemon) that wakes up every few minutes, determines which client models need training, and enqueues training jobs for these models with the right parameters.
The jobs are then picked up by “worker” machines that subscribe to the queue. These machines make an API call to Google Cloud ML Engine to run the training job. One alternative to calling Google Cloud ML is to run the training directly on the worker machines, which we are investigating at the moment.
Our queueing infrastructure is implemented using Google Cloud Pub/Sub.
Google Cloud ML Engine simply runs our training code that uses the elasticsearch-dsl library to obtain all the training logs and trains a TensorFlow model on those. It then stores a timestamped model inside a fixed folder in Google Cloud Storage.

Training algorithm and features: Our model training is a combination of online and batch training techniques using standard algorithms like logistic regression, MIRA, and text CNNs. I’m omitting the details of text processing and feature extraction because that is not the focus of this blog post (see here, here, and here if you’re interested.) I am also not going into the details of the modeling algorithm (perhaps another blog post). For now, let’s assume the modeling algorithm is a given. The important thing to know is that each team classifier has around 10–100k parameters per team label. So, e.g., if a client has 4 teams, then the model for that client will have 40–400k parameters.

Libraries/services used: TensorFlow, Google Cloud (Pub/Sub + ML Engine + Storage), elasticsearch-dsl

Serving

We use TensorFlow Serving for loading and serving our models. We chose TensorFlow Serving for four reasons:

It automatically takes care of loading the latest version of a model and offloading old versions once you point it to a folder the model
It ensures zero downtime when replacing an old version with a new version
Reverting to an old version of a model is very easy.
It is a well-maintained library backed by a solid engineering team.

Server size: Each client represents one model in TensorFlow Serving — one server can serve many such models. A quick back-of-the-envelope calculation reveals the size of the server needed for serving our clients. If we assume 100k parameters per team label and an average of 4 teams per client, then an average client model is approximately 400 KB in size. Thus, using a single machine with 10 GB of memory, we can serve approximately 25,000 clients. In practice, we use a machine with around 25% more memory than theoretically needed as Tensorflow Serving needs to hold multiple versions of a model in memory when swapping out old version by a new version.

Server machine deployment: It’s possible that 100k parameters per team label might be an underestimate for large deep learning models, but it turns out to be sufficient for us and allows us to use a single machine for serving our model. Having one machine greatly simplifies our architecture, and it is highly likely that for many B2B companies a single beefy serving machine will suffice (however, if needed, the models can be sharded across multiple machines.) An accompanying blog post details our TensorFlow Serving deployment techniques and how we use Kubernetes to handle rolling deploys of Tensorflow Serving and add new ML models for serving new clients.

Using a machine with 10 GB of memory, we can serve approximately 25,000 clients.

Libraries/services used: TensorFlow Serving, Kubernetes.

Monitoring

Maintaining ML systems is perhaps one of the most challenging parts of deploying ML in production as detailed in this wonderful paper by Google research. With so many ML models in production, and new ones getting added every day, managing ML models becomes really challenging. We monitor our deployments to catch and alert on a few potential issues within time e.g.

Model performance (precision, recall, F1, etc.) silently dropping for some of the clients.
Model performance skewing (for example in favor of recall over precision) for clients. This can happen, e.g., if the classifier calibration is thrown off and the desired thresholds no longer work.
Model getting systematically biased to perform better on certain kind of questions (IT requests over HR requests; long questions over short questions, etc.)

To catch events of type 1 and 2, we track recall, precision, and F1 for each client, and alert if any of them fall below k% of their value from the last week for some suitable value of k (~10). As an aside, measuring the key product metrics is highly recommended for the deployment of any predictive system (check out Martik Zinkevich’s excellent ML rule book for this and more wisdom.) It is an obvious point, but notably, these are product metrics (e.g. how many times users rejected a team suggested by us) and not metrics computed from offline validation or test data (e.g. what is the current F1 score on the held out data).

…measuring the key product metrics is highly recommended for the deployment of any predictive system…

We use a combination of services like Mixpanel and Grow to track out metrics. We are in the process of considering Prometheus, which is a database designed for time series events, for metrics tracking.

We currently do not have any solution to alert on events of type 3 where the ML system is biased against a specific set of labels. In general, detecting label bias is a really hard problem in ML and is related to active areas in ML research like fair and transparent ML.

Libraries/services used: Mixpanel, Grow, Prometheus.

Conclusion

As the use of machine learning becomes more pervasive in the software industry, clients will come to expect seamless and customized predictive experience. We have made significant engineering innovations in customizing machine learning for our clients and have built what can be considered as a solid version one of an ML platform for supporting dynamic and customized model training. There is still plenty of work to be done as the use of ML for delivering best possible client experience has only just begun.

If you want to check out our system and use Spoke to revolutionize the way requests are handled inside your company, head on over to https://www.askspoke.com/ to sign up for a free 14-day trial.