Scalable and Low Maintenance Model Serving

Code And Wild
Code & Wild
Published in
11 min readNov 21, 2023

At Bloom & Wild we have a small data science team with a focus on using machine learning (ML) to optimise our customer experience, trading and marketing. We’ve deployed a number of models into production across the business and operate these end-to-end within our team. We’re planning a series of blog posts about the projects we’ve worked on, starting with this one on how we first bootstrapped our model serving capability.

A few years ago, we wanted to deploy our first machine learning model into production in a way that could quickly deliver value and be iterated upon, without significant up-front investment in infrastructure. At the same time, it was important that the model could meet the scalability demands of a highly seasonal business without placing a high maintenance burden on our small team.

The specific machine learning problem we were tackling was to rank the products that we show on our website and apps and to present these products to customers ordered by relevance. Up to this point, the ordering of products had been manually defined by trading teams however this was not a scalable practice as the business moved into more markets. By showing more relevant and personalised products to users at the top of our product list (see below), we hoped to improve both the customer experience and our sales revenue.

A grid of flower images, showing the way products are listed on the Bloom & Wild website.
Products listing as shown on Bloom & Wild’s website

Context

There are many complex frameworks for deploying machine learning models into production, so much so that there is now an entire field termed ‘MLOps’ dedicated to managing these. Most of these frameworks such as Kubeflow, SageMaker, Vertex AI etc. assume a full commitment to investing in their ecosystem and managing the respective configuration and services. For a team looking to deliver initial value with machine learning, the up-front time investment and ongoing cost of adopting these frameworks is a challenging proposition. Instead we took a leaner approach using the serverless offerings of AWS, allowing us to start small and iterate our model serving infrastructure.

Requirements

We wanted to be able to personalise the product rankings we generate for each customer, and so we needed to be able to invoke our ML model live from our websites and apps. This live usage meant that the API we use to serve our model needed to return the model output to clients with low latency — we agreed on an SLA with our clients of 100ms p95 latency.

As a gifting destination, the traffic we see is highly seasonal, with significant peaks at times like Mother’s Day where we expect to see traffic increase 10x. Our model API needed to be served to all customers and scale with this demand, without creating an ongoing burden on the team to manage any manual scaling. At Bloom & Wild we deal in physical stock that is perishable, like flowers, and so we also needed to incorporate business logic into our product ranking to account for stock dynamics, making managed model hosting offerings like Sagemaker an incomplete option.

As an initial experiment, we needed to be able to measure an uplift in revenue from this new approach of using machine learning and so our model deployment also needed to support A/B testing and be as low cost as possible to create a return on investment. Finally, given that the output of this model influences every transaction with our customers and the products that we sell, it needed to be highly reliable and to fail gracefully.

To summarise our requirements:

  • Highly scalable
  • Highly reliable with graceful failure
  • Low maintenance
  • Low initial time investment / iterative
  • Low latency
  • Low cost
  • Supporting additional business logic on top of ML model
  • Supporting A/B testing

Solution Design

We chose to meet the (challenging!) set of requirements by adopting a serverless solution. Rather than manage and scale server instances ourselves, with a minimum cost for a dedicated instance, we adopted AWS tooling such as Lambdas, API Gateways and DynamoDB. These formed the first iteration of the model serving API, with additional functionality layered on in further iterations as shown later.

Model serving, with data flowing from Snowflake to DynamoDB, AWS Lambda and an API Gateway.
Initial model serving infrastructure using AWS Lambda.

The benefits of a serverless deployment are that scalability is managed by our cloud provider, making for a low maintenance burden and a low cost that only increases as our trading does. There are trade offs however, the AWS Lambda environment that our model is deployed in is ephemeral and so the application we deploy needs to be lightweight and quick to initialise in the case of cold starts.

In order to iterate quickly and release a minimum viable product to production, we identified a suitable heuristic to use in place of a machine learning model — sorting products by the revenue they have recently generated. Deploying this heuristic allowed us to release to production and migrate clients to our new infrastructure while the ML model was still being developed. This heuristic itself gave an uplift over the manual ordering that preceded it and it became the control that we used in future A/B testing.

ETL

We needed to ensure that the model could access up-to-date feature data such as the sales and stock of each product, or the purchasing profile of a given customer. We used DynamoDB as a NoSQL store for these features, effectively caching them near to the Lambda to allow for quick retrieval based on a product or user ID. DynamoDB was chosen due to its scalability, low cost and the fact that it is a serverless option requiring little to no maintenance.

The data transformation to produce the features was done in SQL views defined in dbt and a regular ETL task would query these views and load the live features into DynamoDB. This ETL was implemented using Meltano, which is an ETL framework that supports plugins for different data sources and targets to be combined. Some of these were available off-the-shelf (e.g. tap-snowflake for querying snowflake) but others we had to develop ourselves, such as a target for batch writing to DynamoDB and another for streaming data to parquet files in S3 (we should open source these).

We used Snowflake as our data warehouse that underpinned all of this however the data was ingested from various sources. We used AWS Database Migration Service (DMS) to stream the Change Data Capture (CDC) ‘deltas’ (changes in data) from relevant exposed tables in the checkout system’s postgres instance to Snowflake via S3. This data was stored in S3 in parquet format, with notification triggers set up for Snowflake to ingest the data as it landed using snowpipes. This approach keeps the data in Snowflake up-to-date for when a feature transformation view is queried.

Latency and Cold Starts

In addition to wanting to meet our p95 latency SLA, we also needed to be mindful of the cold start time when using AWS Lambda. This is the time taken to ‘warm up’ an instance of our container to serve responses and happens when we haven’t received any requests recently or when traffic scales and requires an additional instance. We provision capacity of at least one Lambda instance to be kept warm and while the cold start of additional Lambda instances would not impact the p95 metric, we did want to avoid a negative customer experience of this taking several seconds before a response is given.

We wanted to use python as the language most familiar to the data team however achieving low latency responses required a number of compromises. In particular, the python packages like pandas and scikit-learn that data scientists typically use are too slow to import and particularly impact the cold start time of the Lambda. Instead numpy was used for handling data retrieved from DynamoDB and the XGBoost ML library was used directly to perform model inference.

We used the Application Performance Monitoring (APM) features of Datadog, adding tracing to our Lambda and defining spans for data loading and model inference functions using the ddtrace package, allowing us to optimise sections of our code and monitor latency in production.

A graph showing the latencies of spans within an invocation of our product ranking API.
Datadog trace for an invocation of the model serving API.

Some other optimisations were later made to reduce latency further. We implemented BatchGetItem requests to DynamoDB via the boto3 python package, allowing us to retrieve all the required features in one round trip. We also took advantage of the persistence inside warmed up Lambda instances by wrapping our feature loading functions with a TTL (time-to-live) cache via python’s cachetools, ensuring they are not retrieved again until the values are likely to have updated.

Containerisation

We wanted to maintain end-to-end ownership of machine learning models and their deployment within the same team. As AWS Lambda supports deploying custom containers, we were able to create an environment that data scientists could develop in and test locally on their machines. This, in conjunction with test driven development, gave confidence in allowing the entire team to make small changes and have these released multiple times per day.

An important component for locally testing Lambda images is the AWS Lambda Runtime Interface Emulator (RIE). Rather than configure the RIE for a completely bespoke docker image, we defined a Dockerfile that built on top of AWS’ own Lambda image, ensuring we inherited their best practices for Lambdas and that our images had RIE included by default. We also used multi-stage Dockerfiles to reduce the size of our resulting docker images however we found that this had little impact on cold start times in practice, likely thanks to caching between Lambdas and AWS’ Elastic Container Repository (ECR) where these container images are loaded from.

The container entrypoint was a lightweight python package using the ‘AWS Lambda Powertools’ for validating the input parameters using pydantic, invoking the model and applying the business logic to the output, as well as generating the expected response format. We also used tox in our CI/CD pipeline to ensure that our tests ran against this package as it would be installed in a deployment environment.

A/B Testing

At the time of development, we had not yet adopted a solution at Bloom & Wild for backend-based experiment allocation. A number of ML frameworks support randomly assigning traffic to variants of hosted ML models however we wanted a longitudinal experiment, where returning users continue to receive a ranking from the same model variant in order to measure the uplift in our metric over the experiment duration.

We didn’t want to introduce a shared state or database to our serverless setup or to have to address the resulting issues of trading off consistency and availability. Instead we implemented deterministic experiment allocation — this is done by hashing a fingerprint for the user and the experiment ID together and using the resulting value to allocate the user to an experiment variant. This approach ensures that whenever a user visits the site within a given experiment, the same value is calculated and they are always allocated to the same variant within an experiment without any lookup.

Observability and graceful failure

While our infrastructure and scalability are managed using AWS’ serverless tooling, we still needed observability and monitoring of data quality and API performance. We took advantage of Great Expectations style data tests as a way of defining the expected values for our model features and their upstream data sources and we implemented these tests in dbt using the dbt_expectations package. Testing the model features is important as it helps identify feature drift, where over time feature data can change from what a model was trained on and impair model performance, indicating a need for retraining or generalising the model. We also used Datadog to define alerting around our SLAs — notifying the team if the p95 request latency exceeded our agreed level or if the numbers of 5xx responses from our API passed a minimum threshold.

To avoid our model serving infrastructure becoming a critical component and requiring out of hours support from the data team, we requested that clients implement the circuit breaker pattern when consuming our API. When our API returns an error response, the clients (the B&W website or apps) can fall back to another mechanism for sorting products for the rest of the user’s session, such as by cheapest price.

Costs

AWS Lambda is generally very cheap, being charged at $0.20 per million requests along with a small charge for duration, which works out at about a dollar per million requests at our short durations. The DynamoDB costs for this setup are similarly negligible at $0.29 per million ‘read request units’, along with a few dollars per month of feature writing costs that do not scale with traffic. This is significantly cheaper than even the smallest instances of a managed model inference endpoint, even before considering the additional infrastructure these would need for storing features and layering on business logic. These costs also scale linearly with traffic, with no management needed for this scaling.

Iterating the ML Platform

We were able to use the platform described above to serve our ML model and demonstrate a significant revenue uplift for the business. In further iterations we expanded the platform to ‘close the feedback loop’ by capturing an event stream from the Lambda via AWS Firehose, giving richer training data for further model training. This training was moved to cloud notebooks and we adopted Feast as a feature store, allowing us to re-use our feature definitions in other ML projects. Similarly we adopted MLFlow as a way of tracking all of the model parameters we tried when training models in notebooks, capturing the metrics and model artefacts in a model registry, with promising models then able to be promoted into an experiment in production.

A diagram of our model serving framework after several iterations. It now includes data flowing from snowflake via a Feast feature store, and model artefacts being managed by mlflow.
Our model serving infrastructure, now with mlflow, Feast and Firehose.

After seeing value with the model we deployed, we expanded our architecture horizontally across the other brands and markets in the Bloom & Wild group. We did this by exposing separate API Gateways for the branding and localisation appropriate to each market but were able to serve these via the rest of the existing infrastructure. There are still some areas to improve — for example, AWS Firehose does not give an SLA for latency, with rare cases of it taking several seconds to emit an item to the stream. This could be addressed by adopting an AWS Kinesis stream prior to firehose, which is more in line with AWS recommended practice, however this would introduce additional cost and complexity.

While we’re still evolving this set up to try out new features and model ideas, we’re also very happy with it — we were able to deliver value to the business quickly and at very little cost. We’ve kept an open mind about build vs buy and continue to evaluate the machine learning platforms and tooling available, as cost isn’t really the constraint here, but each time we’ve realised that the flexibility of being able to iterate our own (still relatively simple) set up is far greater than with off-the-shelf offerings.

--

--