Why and How We Built an Internal Machine Learning Platform

AJ Ribeiro
Bigeye
Published in
6 min readMay 13, 2021
Image credit Joshua Sortino

In the current Bigeye platform, machine learning (ML) drives one of our most powerful features: Autothresholds. Autothresholds enables customers to get actionable alerts on hundreds of tables without having to manually set up alerting criteria. The ML removes a huge burden from data engineers and data scientists. They save time when implementing data quality metrics and get saved from the continuous re-calibration that has to be done as the data changes and new data is introduced.

We are doubling down on the intelligence in the platform. To pave the way for more amazing ML-driven features, we built a minimalistic ML platform within Bigeye to ensure that we train our models without any slowdown.

In this blog, we will discuss why and how we built our ML platform and why the decisions we made along the way will enable us to build even more intelligence into Bigeye in the future.

The Machine Learning Slowdown

Bigeye determines which data quality metrics to monitor and then leverages ML for time series prediction to automatically detect when those metrics go out of bounds for Autothresholds. We wanted to build a microservice that is responsible serving these Autothresholds, but because model training can be slow (on the order of 10s of seconds), we knew it wasn’t feasible to have these models trained in realtime in a traditional blocking API endpoint for a couple of reasons:

  1. Models need to be retrained as we collect more data.
  2. It’s not reasonable to tie-up threads on a calling service which runs scheduled metrics for 10s of seconds while waiting for a response.
  3. Customers can run metrics on-demand from the UI. If they end up watching a spinner for 10+ seconds, we’ve created a bad customer experience.

Understanding these issues, we made the decision to pre-train models offline. Planning ahead for the future of Bigeye, we incorporated a few more requirements:

  1. As product adoption continues to grow, we knew we would want a training service that can independently scale. This led us to decide that we wanted a separate service for model training rather than keeping it in the service that serves Autothreshold predictions.
  2. Because we will continue building more ML functionality into the platform, we knew we needed something that we can build upon in the future (a platform).
  3. Since Bigeye serves both SaaS and on-prem customers, the solution needed to work in both deployment environments. This contributed to our decision build something in-house rather than using a 3rd party offering, like AWS Sagemaker.

With these requirements in mind, we decided to build ourselves a scalable, cloud-agnostic ML platform within Bigeye.

Building an ML Platform within Bigeye

In any non-trivial computing environment, inter-service and inter-component communication is critical. Defining schemas for interfaces can be extremely useful, and some popular options are Protobuf and Thrift.

At Bigeye, we have chosen to use Protobuf, which allows us to define object schemas in one place and compile code in several languages (Java, Python, Javascript) to make the objects reusable in all of our services. The use of Protobuf plays a key role in the design of all our services, and our ML platform is no exception.

In order to describe the components of our platform, it’s useful to use a simple illustrative example. For this, let’s say that we want to build a model that predicts the weather in San Francisco.

Components and interfaces

In the design of the platform, there are three key components: feature providers, models, and the orchestrator service. Additionally, feature providers and models need to conform to specific interfaces in order to enable onboarding new models to the platform.

Feature providers

We wanted features to be canonical and reusable across different models and potentially the whole codebase. We came up with the concept of “feature providers,” which can be instantiated and used to retrieve the data needed for training our models.

Recalling our weather model example, the feature inputs to our weather model could be the San Francisco weather data for the last 30 days. In this case, the feature provider would be a bit of code that fetches that data and organizes it into a form to be consumed by the model.

All features, as well as inputs to the feature providers, must be defined in Protobuf, allowing us to easily construct training requests. Defining the features themselves in Protobuf also allows us to consume them in our models (and anywhere else in our code).

Finally, all feature providers must implement a method provide(), which consumes the feature provider inputs and returns the feature as an output. Feature providers in theory could pull the data from anywhere — a REST API, a database, deconstruct an input, etc. — as long as they conform to the interface and return the correct object.

Models

ML models do the heavy lifting for the intelligence in the Bigeye platform. In order to enable our data science team to innovate with as little friction as possible, we have been as permissive as possible with models while still standardizing them in order to enable new development.

In our weather model example, the model would be the code that takes the 30 days of data from the feature provider and trains an ML model to predict future weather conditions.

The only requirement on ML models today is that they implement a train() method, and that the model training consumes our Protobuf-defined features. This train() method should store the model state within the class itself for future consumption at runtime.

Orchestrator service

The last piece of our platform is our orchestrator service. Our goal was to design it in a way that its work would be formulaic. Specifically, it would receive a training request for a model, fetch the features for the model using feature providers, train the model, and then store it to be reused by whichever service needs to use the model for runtime evaluation.

In order to aid in finding the right models and feature providers, the orchestrator has two registries. The first is a map from model types to the model code and features required for that model. In the weather prediction example, this would be from the weather model to the last 30 days of weather data. The second registry is a map from features to feature providers (e.g. from 30 days of weather data to the code which actually fetches the data). The requests come from a message queue, which allows us to scale the number of orchestrator machines to keep up with business demand.

Onboarding a new model

The ML service provides ease of use and flexibility. Onboarding a new model can be done without touching the orchestrator code at all. First, a model needs to be written and entered into the registry. If the features needed for the model already have providers, then nothing else needs to be done and the orchestrator can start processing training requests for the new model. If new features are needed, then new feature providers need to be written and registered. After that, the platform can start processing training requests.

The behavior can be visualized as follows:

  1. A training request comes in, which includes the model type identifier and the inputs for feature providers.
  2. The model registry is used to look up the required features as well as the model class from the model type.
  3. The required features from the model registry are used as inputs to the feature registry.
  4. The feature registry is used to look up the necessary feature inputs as well as provider class(es) from the feature types.
  5. The feature provider inputs are extracted from the training request.
  6. The provider class is used to provide a feature for the model.
  7. The feature is input into the model class which was looked up from the model type in the training request.
  8. The model is trained and cached to be consumed at runtime.

Do These Kinds of Problems Sound Exciting?

We are tackling some of the most difficult problems facing data-driven companies today with a deep team that has experience solving these problems at scale. And our team is growing fast! If these challenges sound exciting, we would love to talk. Find our open position here.

--

--