Delphi: Insider’s Machine Learning Platform

Published in

Insider Engineering

4 min readDec 29, 2023

AI/ML plays a crucial role in product development at Insider. We believe that machine learning can improve the efficiency of our products by decreasing the time to create a digital marketing campaign, improving the customer’s parameter choices by data-driven decisions, and targeting the right personas at the right time predicted with user data.

As a Software-as-a-Service platform, Insider solutions run in a multi-tenant fashion. All the customer data, configurations, datasets, models, etc. are totally isolated. Each tenant has a different combination of industry, vertical, country set, user base, etc.; therefore, the same solution must be further tailored for each tenant. Since this is impossible to achieve manually for 10+ solutions and 1000+ customers, the machine learning pipeline is fully automated to achieve the very best results for every customer.

The Platform

Our AutoML platform is named Delphi, referencing the famous sanctuary in ancient Greece, which was believed to be the center of the world. Likewise, Delphi is the center of our machine-learning solutions. The platform automates the entire machine learning application lifecycle, from processing raw data to monitoring predictions, on a per-tenant, per-solution basis.

The platform consists of Apache Spark applications running on Amazon EMR to perform feature extraction, feature selection, dataset preparation, model training, model evaluation, and generating predictions. These Spark applications run at the tenant level that results in thousands of jobs, which are orchestrated using Apache Airflow running on Amazon EKS. On the speed layer, Spark Streaming is leveraged to generate real-time features to generate instant predictions. The machine learning models are deployed with MLflow to Amazon EKS to enable scaling quickly according to the traffic.

The infrastructure of our Machine Learning platform Delphi.

Feature Store

The feature store concept works great for standardising the machine learning pipeline and keeping it simple. Having several different feature catalogs enables variety and richness of input signals, while the discoverable feature store structure makes having thousands of features manageable. The requirements, evolution, and scale of our feature stores are discussed in another blog post, Building a Feature Store with Apache Iceberg on AWS.

Feature Selection

One of the most important components of the machine learning pipeline is the feature selection step. This is where the high-quality signals are separated from the low-quality and noisy data. The nature of running our applications multi-tenant makes the feature selection a critical step as the distribution of data among the datasets of different tenants can be very different. Therefore, using the same set of features for the same solution across tenants is not an optimal solution for the best performance. That is why we are running a four-staged feature selection for our solutions at the tenant level, and we discuss it deeply in How Our Auto ML Platform Handles Feature Selection.

Metadata

An Amazon Aurora-backed relational database helps to keep track of all the processes going on in Delphi. Every machine learning model can be traced back to see which dataset is used to train the model, which features are incorporated into that dataset and how those features are selected, how old is the dataset and the feature selection process, what is the inference metrics of the model and how do the predictions spread across classes and so on. These metadata are very useful to debug a production model when required, and create beautiful dashboards to play with.

Model Training

The model training infrastructure splits for traditional machine learning and deep learning models. The traditional machine learning models are trained using Apache Spark on Amazon EMR using the pre-exported training dataset in Amazon S3. The training job finds the most recent dataset using the metadata in the database, trains the model using the fresh data, saves the model metadata to the database, and persists the artifacts in Amazon S3. For the deep learning models, PyTorch models are trained using Amazon Sagemaker, eliminating the need to manage GPU clusters and allowing us to run parallel training jobs effortlessly. The model artifacts are saved in Amazon S3 as well.

Inference

At the end of the day, the machine learning pipeline performs all this pre-work to enable one objective, generating the predictions. The inference process is divided into batch and real-time. Delphi uses Spark applications to perform batch inference, and the results are persisted in our Customer Data Platform to enable them on all Insider products. The real-time inference has a more sophisticated architecture. The models are wrapped using MLflow, deployed on Amazon EKS, and serving thousands of predictions every second to other services. The architecture is further discussed in the Scaling ML Model Serving on Amazon EKS with Custom Metrics.

Orchestration

Finally, there are thousands of tasks running every day, with several tasks depending on each other. The tasks should run on a schedule, have a retry mechanism to handle temporary failures, and have dependencies on other tasks to start. These requirements are met by the offerings of Apache Airflow, and we have been using Airflow to orchestrate our workflow since the first days of Delphi. You can read more about how we keep track of EMR job flows and utilize DynamoDB for tracking their state in the Orchestration of AWS EMR Clusters Using Airflow.

Conclusion

Our AI/ML infrastructure evolved over the course of years. Our first product features were implemented as individual machine learning pipelines, and with the increasing number of features, maintenance became a burden. Delphi was designed to overcome that burden and to deliver the best product experience to every tenant. Thanks to Delphi’s centralized, automated machine learning workflows, we have seen significant improvements in quality, agility, and productivity in our products and ML development teams.