Architecting a Scalable Real Time Learning System

Max Pagels
The Hands-on Advisors
9 min readSep 5, 2018

Online or incremental machine learning is a subfield of machine learning wherein we focus on incrementally updating a model as soon as we see a new training example. This differs from the more traditional batch learning approach where we train on all of our data in one go.

Online learning has several upsides, the two most important of which relate to space complexity and speed. Training a model online means you don’t need to keep your entire training data in memory, which is great news if you are dealing with massive datasets. The incremental nature also means that your model can quickly react to changes in the distribution of the data coming in, provided the algorithm you use is tweaked properly.

The second point makes online learning especially attractive: implemented correctly, it can do near real-time learning as well as inference. It’s a good choice in situations where you want to react to unforseen changes as fast as possible, making it a viable option for implementing dynamic pricing systems, recommendation engines, decision engines and much more besides. It’s also an excellent choice for immediate-reward reinforcement learning (e.g. contextual bandits).

One challenge with online learning is that if you want to use it to make a real time learning system, scalability can’t be solved in the same way you would with batch learning systems. You can’t make several instances of a model and load balance between them, since the model can change from second to second. Making a system that can withstand thousands of training examples is far from trivial.

At Fourkind, we’ve built online learning systems for a variety of different purposes. During these projects, we’ve learned a lot about the architectural considerations needed to make real time learning work. In this post, we’ll sketch an scalable example architecture designed to handle thousands of data points each second. Let’s start with some key considerations first.

Choosing your learning algorithm

One of the most critical choices when making an online learning system is the choice of learning algorithm. Algorithms that use (stochastic) gradient to learn optimal parameters are naturally suitable for incremental learning¹. Lots of algorithms support using SGD for optimisation, but for almost all situations, we recommend using a linear learner like linear or logistic regression. There are several reasons for this:

  • Linear learners are fundamentally well understood
  • Efficient implementations exists for just about any commonly used programming language. For more exotic languages, you can roll out your own SGD implementation easily; the update rule is straightforward for most common losses (e.g. log loss, Huber & MSE)
  • You can make predictions using only the learned parameters/weights, because the prediction is a simple linear combination (we’ll see why this important later on)
  • Explaining how a linear learner arrives at a prediction is easy compared to black-box models

The obvious downside of linear learners is the linearity constraint. This can be partially overcome with good feature engineering, the use of higher-order interaction terms, and the hashing trick. It’s not a perfect solution, but in many cases, a linear model that reacts to change quickly trumps a nonlinear model that doesn’t.

Managing the training data stream

Once we’ve chosen our learning algorithm, we need to manage how training data is fed into it. The obvious choice is a queue of some sort. SGD assumes that your training data arrives in a random fashion; if that is indeed the case, a simple FIFO queue structure is sufficient. In cases where that assumption does not hold, you can either buffer and shuffle data before feeding it into a FIFO queue, or alternatively implement shuffling via a priority queue. In most cases, we prefer to use the queue primitives (LPUSH, RPOP etc.) of an in-memory store like Redis, instead of a dedicated message broker system. The reason for that is twofold: it’s sufficient for our needs, and it’s multi-purpose since we can use the same system for storing other things (more on those in a bit).

Model versioning for predictions

So far, we’ve only considered incremental learning of a model, but we also want serve up real-time predictions. Here’s where our recommendation to use a linear learner comes into play: in order to make a prediction, we only need the current context and the parameters we’ve learned — nothing else. If we were using a neural network, for example, we’d need to know the parameters, architecture of the network (including the choice of activation functions), and a bunch of other information besides. For linear learners, we need a lot less. If we have a linear model for predicting height based on weight (our context), we have height = <some learned parameter> * weight + <some learned intercept parameter>. Given a context, those two parameters are all we need for prediction. In practice, linear models may have thousands of parameters, but in the grand scheme of things, that’s not a lot either.

In our example system, as our linear learner learns on examples through the Redis FIFO queue, it periodically saves the model parameters back into Redis. How often it does this is configurable: for some applications, it might even be feasible to do this every time a single training example is learned on. The key thing to note here is that saving (and indeed retrieving) the parameters should be an atomic operation. In Redis, this is guaranteed if you use the SET/GET commands (the Redis core is single-threaded). For more complex interactions, you can use a locking mechanism. The key requirement is that all model parameter changes are updated in one go; if we end up in a situation where some parameters have updated but others not, we end up with incorrect predictions.

Serving your predictions

It’s sensible to separate the logic of training and inference. In our setup, since model parameters are update atomically in Redis, serving up predictions takes place as follows:

  • Preprocess the context to make it suitable for inference (we’ll skip the issue of context retrieval for now)
  • Retrieve model parameters atomically from Redis (GET)
  • Calculate the linear combination of model weights and context features
  • Return result

Another key reason for using an in-memory store becomes apparent as we look at the logic here: for every prediction request, we need to retrieve new model parameters, since may they change rapidly. A traditional relational database or NoSQL solution is likely too slow for our purposes.

If you wish to serve your predictions via a REST API, you have several options: you can implement the prediction logic and deploy it directly on logical servers, you can use containers, or you can opt for a serverless solution. The key requirement is that your prediction API should scale horizontally: as the load increases, you can spin up more hardware and balance between instances. Ideally, it should also spin down when the load decreases, saving costs. Our recommendation is to use a serverless solution (Lambda on AWS, Azure functions on Azure, etc.) or a container-based approach. Typically, the horizontal scaling is handled for you, and since you aren’t doing anything compute intensive when serving linear predictions, the resource footprint for inference is quite small.

Logging your data

If you use online learning to incrementally train on new data as is comes in, it’s hugely important to log that data (doubly so for reinforcement learning purposes). Predictions should also be logged — properly logged replay & prediction data lets you optimise and validate new core learning algorithms offline, and in the case of reinforcement learning, lets you ask counterfactual questions (what would happen to the cumulative reward if a different policy was used).

Logged data doesn’t need to be persisted immediately, unless your use case requires you to do so. A relatively slow columnar database or NoSQL database should suffice. The important thing is that every single piece of data gets stored somewhere, so you can analyse and debug at your leisure.

Feature storage (optional)

Something we haven’t covered yet is how to handle the context needed for making a prediction. If you are a building a recommendation engine and you are recommending different items based on age, that piece of information needs to be retrieved from somewhere. If it’s feasible, the easiest solution is to send it as part of the payload of the prediction request. In cases where it’s not feasible (e.g. in an Internet-facing application where the data unavailable and/or sensitive), you’ll need a fast feature store. Redis can be used for this purpose, too. In such a scenario, our serving logic becomes:

  • Retrieve context from Redis (GET)
  • Preprocess the context to make it suitable for inference
  • Retrieve model parameters atomically from Redis (GET)
  • Calculate the linear combination of model weights and context features
  • Return result

Miscellaneous state storage (optional)

Beating on the in-memory store drum once more: it’s the perfect place to store things like exploitation/exploration hyperparameters (for reinforcement learning), A/B test segments, or any other type of small additional information you might need when serving requests.

Other notes on scalability & complexity

On a high level, our example architecture is comprised of three logical components: one for incremental training, one for inference, and one for state management.

The training component — unless we are using a fully distributed system that a) incurs some overhead and b) places restrictions on the algorithms you can use — is only vertically scalable. You’ll want to run it on compute-optimised hardware, especially one with high single-threaded performance. Memory isn’t too much of an issue since we are learning incrementally.

The prediction component, which is essentially stateless, is horizontally scalable. Because of this, you can get away with modest hardware instances and spin up more as needed. In cases where you want to do inference on a batch of examples, some more memory is needed, but compute-optimised hardware isn’t typically needed.

For the FIFO queue and key-value storage in Redis, a large pool of memory will ensure you have a sufficiently large queue and space for a feature store. In terms of operations, most of the ones you’ll use have constant complexity: queue push/pop, GET & SET are all O(1).

An example architecture

The example architecture presented below is for the AWS stack (ECS for training; ElastiCache for queuing, model parameters and feature storage; Lambda for inference; RDS for logging), but can be easily adapted for use on other cloud vendors’ platforms, or in-house hardware. It is designed specifically for use with linear learning algorithms; non-linear solutions require a different setup due to the relatively high compute costs associated with both training & inference. Some key considerations, e.g. access management and API gateways, are left out for brevity’s sake.

High-level example architecture for incremental learning (real-time learning & inference)

Example scenario: online reinforcement learning-based recommendation engine

Let’s go through an example scenario to understand how our example architecture works in practice. A natural choice for online learning is recommendations: let’s pretend we’ve implemented a reinforcement learning solution that suggests a piece of content to users of a webshop, based on past purchase history. Reinforcement learning aims to optimise against a particular reward function by receiving numeric feedback: in this scenario, the reward could be something as simple as a positive number for a purchase and zero for a session with no purchase. In applications where you don’t get an explicit zero reward, such as in this scenario, you can a) assume a reward of zero after a set time horizon, or b) assume a zero reward immediately and compensate later on if a user makes a purchase. The exact configuration depends on your use case.

From the end user perspective, when they browse the webshop, the REST API serves up predictions by fetching the model parameters (and, optionally, needed context features) from the in-memory data store and computing the best recommendations using a linear combination of features and parameters. Since lookups are essentially O(1), the API remains performant (assuming proper configuration) and answers quickly, regardless of the number of training examples being streamed to the learner.

As users make purchases (or don’t), data is both logged & pushed to our Redis queue. From Redis, our linear learner consumes and trains on examples as fast as it can, and versions out learned model parameters back to Redis (either after each example of after a configurable time horizon has been reached). When the rate at which training examples arrive remains lower than the learners ability to train on them, the model parameters in Redis update almost in real time. When the ingestion rate exceeds the learning throughput, the age of the stored parameters grows linearly. Herein lies the key reason our example architecture separates vertically scalable components from horizontally scalable ones: during massive traffic spikes, the underlying model is allowed to lag a bit if needed. The REST API, however, remains performant: end users get their predictions quickly, and the user experience remains consistent. During normal load, both learning & inference happen in close to real time.

Closing thoughts

At Fourkind, we’ve been lucky enough to help bring online systems to production, and have seen first-hand the improvement real time learning can bring to what seems to be an endless amount of application areas. Building a scalable online learning system is nontrivial — we hope that this post provides those interested in real time learning some food for thought, and a baseline against which to iterate and improve.

A special thanks to Machine Learning Partner Jarno Kartela & Solution Architecture Partner Lilli Nevanlinna for their general feedback and technical expertise.

[1] other optimisation routines are also capable of incremental learning, but due to popularity, we focus on SGD in this article.

--

--