ML models deployments

Tomer Harpaz
Plarium-engineering
4 min readAug 9, 2022

--

At Plarium, our data engineering team and data scientist team work closely together to apply ML pipelines in various areas. It could be used for improving marketing effectiveness or enhancing games’ experience. ML models have become a necessity for any data-driven organization in the past few years, but how do you integrate them into your business?

Without covering the ML models - I’ll explain the flow, architecture, and implementation we used in some cases.

Our ecosystem

Some background about our ecosystem: Since we are hosted on Google Cloud Platform our services are deployed on GKE, we prefer to use containers and Kubernetes wherever it is possible. Most of the data is loaded to BigQuery (and some are files saved to buckets). Complex heavy-duty ETLs or Streaming processes are written in scala using Apache Spark and deployed to Dataproc, but simpler data processes, data scientists, or analysts processes are written in Jupyter notebooks. These notebooks can either be written in Python or scala, with or without Spark and are also run over GKE. Tap the link to read more about our Jupyter & Spark environment from our Chief Architect. Any ETL process is scheduled to run using AirFlow. Our monitoring is done using Prometheus with Garfana.

Example Use Case

  • ML model is scheduled to run once every few hours to analyze players’ behavior and output games recommendation.
  • The game adjusts player experience based on the recommendation.

ML Process

In this use case, processing a large number of data points and historical data made us select offline predictions. So our ML models are a python code that runs over Jupyter notebooks triggered by Airflow. We have a JupyterHub deployed over Google Kubernetes Engine pods. The process reads its input from BigQuery, where all our games data resides, and saves the results back to BigQuery.

Loading results to cache

Serving the results to the game app requires very low latency, which led us to cache the results in Redis. Here we used another Jupyter notebook triggered by Airflow. We use scala code with Spark to load the dataframe that we load from BigQuery to Redis (using spark-redis package). The Redis table will be based on the specific model and version, and the key is based on some of the model params or business use cases.

The division to tables allows us the use several versions of the same model for AB testing and control a new model’s gradual rollout.

Serving the results — Requirements

This is the tricky part. Serving the ML results back to the game app has the following technical and business requirements:

  • Request Authentication — this will also determine the relevant model for the recommendation
  • SSL connection — for obvious security reasons
  • Low latency — the mobile game needs to get the model recommendation but it can’t wait for it
  • High Throughput — the solution must be scalable, and support a (very) large number of concurrent requests
  • Monitoring the service is up and alerting on service failure
  • Business Monitoring of requests count and Alerts on a low number of requests or high amount of error responses
  • Reduced response size — for mobile performance consideration, in case the player is using an old device this could be crucial, and also network traffic should be minimized.
  • Model balancing according to model versions weights — a new model version is usually tested first with a sample population. Then, using weight configuration, we can adjust the amount of population using a specific model.
  • High Availability — even though our game studios use robust code that will not crash or get hung in case the service is not available — we still need to make sure it is always available (to the extent of some nines :-) )
  • Multitenancy — the service can be used by several studios and support different Models per studio

Serving the results — The Solution

For this Rest service, we used scala with Akka actors deployed on GKE pods.

The GKE deployment answers several of the requirements:

  • It enables us to easily scale up to as many pods as needed
  • Deploy ingress with SSL connection and load balancing
  • Service is highly available since there are multi pods

For authentication, we used JWT token which also allows us to know the requester, and use it to get the relevant model products.

To support monitoring, the service exposes metrics endpoint for Prometheus with counters for different responses and requests which is then called by Grafana.

In Grafana dashboard we defined alerts on either the number of requests that are below the threshold, or if the amount of error responses is higher than expected.

We implemented model balancing logic so the requests can be balanced between different model versions according to configured weights.

As already mentioned above - to enable low latency, the ML products are stored in Memorystore (a GCP managed Redis).

The actual response can either be JSON (to enable easier analysis of the response) or Bit packet data (using bitwise manipulation) to enable the smallest response size possible.

Thanks for reading!

--

--