Mathieu Deleu
ADEO Tech Blog
Published in
11 min readMar 25, 2024

--

In an e-commerce website, many interactions with customers are managed by an AI model, in terms of recommendations, searches, personalisation, and so on. Each one of those models is supported by a data product in production, in which data scientists are involved in every stage of its lifecycle, from problem formulation to ongoing maintenance.

In this fast-moving world, we data scientists need to quickly iterate on our algorithms if we want our products to remain relevant, and we need to do it safely.

For that we need autonomy in production, not only developing models but also deploying, monitoring, and maintaining them in a real-world environment. A true “you build it, you run it” approach, that leads to multiple benefits :

  • Faster Iterations and Adaptability

Data scientists can respond more quickly to issues or changes in requirements when they have direct control over the deployment and operational aspects of their models.

  • Reduced Silos and Bottlenecks

Reducing silos and bottlenecks in the development process reduces the risk of potential problems when moving from development to production.

  • Continuous Improvement

Teams are more motivated to implement continuous improvement processes when they are directly responsible for the ongoing success of the product.

This approach aligns development and operational considerations, leading to more reliable data products.

Reliable in this case means quality, traceability, reproducibility, and, of course, that what we are doing meets its business objectives.

Our team at Adeo Search and Publication is responsible for creating and maintaining a popularity score for each offer in our product catalog, for multiple business units. This score, computed once per day in batch mode, is used in various applications, from search to recommendations. We build it, we run it. We serve multiple websites for more than twenty millions unique visits each week. It has to be reliable.

Strictly speaking it’s a regression on tabular data. What’s more, we only execute it in batch mode with no online inference. The easiest use case for data scientists, right ?

Lets see what that means for us to make it reliable, how we can be autonomous, what are the challenges we faced, how we faced them with which technical solution and every aspect needed for this data product to be successful.

Code

It’s not just writing efficient and elegant functions. We have to keep in mind this question “What could go wrong with our code ?” Potential errors, deviation from a standard of consistency, unused code, secrets leaked, all that kind of bad stuff. How can we prevent these mistakes ?

  • Version control

We of course use version control systems to track changes to the code base. This facilitates collaboration and backtracking, and enables us to understand the evolution of the code over time.

Each new feature or correction is developed in a feature branch, tested in the local environment, then reviewed in a pull request to the development environment, tested again and finally deployed in production.

  • Access Controls

Sensitive information like secrets are stored in HashiCorp Vault, reducing the likelihood of exposure due to mishandling or insecure storage practices.

  • Automatic code quality checks

Reviews are a necessity but nearly not sufficient. Unit tests, code linters and formatters help us to catch and address issues early in the development process.

We use Black, Pydantic, detect-secrets, sqlfluff with pre-commit, that we run locally in VSCode or your favorite code editor as well in our CI in Github Action, to which we add SonarQube and Checkmarx.

Fig 1: Example of pre-commit sqlfluff
  • Infrastructure as a code

Everything should be defined as code for a seamless automation and versioning via Terraform and GitHub

Version control, access control, tests and infrastructures all serve our objectives of quality, traceability and reproducibility. Our code is tested, maintained and subject to very little human errors.

Data

It’s obvious we need data to feed our algorithms. It is also obvious that we need accurate data to make accurate predictions, not just once, but at any time in production. Now the question is, what are the types of troubles that we might run into ? Data not loaded at all, suspiciously not enough data, changes not known of the sources… We can also make mistakes in coding some business rules.

Data flows from what we ingest to what we output, let’s face our challenges this way.

  • Data organization : setting up for multiple business units and multiple environments

All our raw data comes from our data warehouse in BigQuery, and that’s also where we export our results and finely crafted features.

We need to run the same code for multiple business units and multiple environments, and DBT is just the right tool for that. Following DBT best practices we created three model directories, staging, intermediate and marts. Queries may differ slightly from one business unit to another for the staging part, but they are identical thereafter.

All we had to do was create a profile for each business unit.

Fig 2: Example of profile.yml

That way with the same code a dedicated data pipeline create a datamart as an input for our algorithm, which has it’s dedicated Machine Learning Pipeline (feature lookup, scaling, predict) after that.

  • Data quality

Is the data good enough for us ? What is the metric of good enough?

Freshness

First of all we need our inputs to be up to date. For example we are supposed to have the previous day’s analytics history at our disposal every day. Thanks to the freshness blocks, we can define the acceptable amount of time between the most recent record, and now, for a table to be considered “fresh”.

Fig 3: Example of the freshness block. Filtering is very important here if we don’t want to do a full scan of the table.

What we got is what we expect

Unit testing is precious here and we use the DBT implementation from Equal Experts. We create mock data and the expected data that our models should be able to produce. The mock data is passed through our queries, and the results are compared with the expected data we’ve defined beforehand. If there’s a match, the test will pass; otherwise, it will fail.

Fig 4: Example of unit test with DBT : tag, mock inputs, expected output

Sudden drop of volume can be checked also but using real data and with a SQL query.For example, we built a test to check if the volume of data we receive is not suspiciously low compared to the day before.

Quality gates at each step

In our ML pipeline we need to check at each step if we don’t create or lose data. We found SODA pretty useful for this. At each step we can check if there are no duplicates, no insufficient data, no null columns and even null variances within columns. Using SODA tests as validation steps make the whole pipeline more reliable.

Fig 5: a list of soda yaml files and a call to a function we created leveraging SODA Core library.

Again, our objectives are met through organization and tests.

Machine learning

We want reliability, in terms of quality, traceability and reproducibility. First of all quality in Machine Learning means a productionable pipeline, without human interventions, leaving less room for errors. Our automated tests and validations as seen above are implemented at every stage of the pipeline so that any errors or inconsistencies in the data or model are caught early and corrected. Of course, there is a lot more.

  • Pipeline

We chose ZenML as our framework for ML pipelining and we are pretty happy with it. We can design robust and standardized pipelines. We have the flexibility to run our pipelines locally and/or on our preferred orchestrator tool (such as VertexAI), ensuring native reproducibility. Furthermore, each run is documented, labeled, and we meticulously track all metadata associated with it.

Fig 6a: Example of steps defined in a ZenML pipeline
Fig 6b: pipeline executed in Vertex.ai
  • Scalability

Speaking of the Google suite, we previously had problems with pods crashing due to a lack of memory, but these have been resolved by using the serverless BigQueryML. This is particularly useful for serving, since we can train our models with a different engine and serve them with BigQuery.

  • ML artifacts

Models

We store our models in MLFlow, along with some metadata associated with them. (performance on sample test, configuration, feature names, table names…)

If something goes wrong with a new model we can manually go back to an old version. Need I tell you about the time a model learned that the more a product is sold, the lower its click-through rate? (a classic case of regression towards the mean) Well, we didn’t keep it.

Fig 7: MLflow models and versions

Data

Each training and serving data is versioned and kept for a time period in a bigquery dataset.

Fig 8: Example of versioned input data
  • ML monitoring

Offline evaluation metrics (NDCG, RMSE, …) are stored and monitored after every training on the test sample and every day on real data.

Outlier detection is set up on our outputs.

  • Explainability

SHAP values are computed thanks to the explain_predict function of BigQuery at each serving run. We use those values at different stages.

We showcase the overall features importance of the model using a barplot displaying the average of the absolute values of each feature in a dedicated PowerBI dashboard, to which our users have access, and it helps build a relationship of trust with them.

We are also able to explain a particular prediction with the contributions.

We observe the partial dependence plots to understand how a given feature impacts the scores.

Machine Learning Configuration

Every parameterisable element is kept in a YAML configuration file, from Bigquery destination output to input features.

Fig 9: Example of our config.yml file.

One of the main advantages of config files is that they are versioned and every change of configuration is made in production through our CI/CD pipeline.

CI/CD, Pipeline Orchestration and Automation

We firmly believe every member of the team, data scientist, analyst, engineer, should be able to push himself to production with ease, in compliance with our quality standards. When everything is fine it’s click and deploy. Let’s see how.

Every increment is pushed live through the process of our CI/CD and we follow the standard procedure:

  1. Application build
  2. Unit testing
  3. Building the application image
  4. Push to a registry
  5. Deployment
  • Separate components

Each component ( preprocessing data pipeline, training pipeline, serving pipeline, performance monitoring pipelines) are autonomous and are deployed independently with a dedicated workflow in Github Actions.

Fig 10: component organization and workflows
  • Data pipeline components

After all the CI steps we deploy with Turbine, our in-house CD application based on Kubernetes.

The whole workflow looks like this :

Fig 11: details on the data pipeline deployment workflow
  • Machine learning pipeline components

This is a tricky part. Let’s start when a data scientist is happy with an improvement, has pushed it to dev and the team is happy with how it works.

Once the pull request is validated on prod, it automatically triggers a GitHub Action that deploys the pipeline to the production environment, which trains the model with data from the production environment, performs some checks to compare its performance with the model currently being served in production, and then, if all checks are positive, automatically deploys the new model.

ZenML creates all that is needed in GCP, a Vertex pipeline, a Cloud Scheduler and a Cloud Function.

We chose a strategy based on frequent retraining, so we started to run our training pipeline each week when our serving pipeline runs each day.

Fig 12: details on the machine learning pipeline deployment workflow
  • Orchestration

We are still in need of a managed orchestrator, so we started with cronJob for our data pipelines and Cloud Scheduler for our ML pipeline.

With those workflows I can assure you it’s very easy, and very safe, for every member of the team to make its code go live.

Observability

Even if we are confident with our deployments we have to be aware of what happens and particularly when something goes wrong. We all know it, if something can go wrong, it will.

  • Log monitoring

All the tests we see above, and all the unexpected events that could occur in our Platform are monitored with Datadog, after defining our monitors in Terraform.

Fig 13: Example of Datadog log monitoring
  • Incident Response and Handling

When something has gone wrong, we know about it thanks to Slack.

Fig 14: Datadog log monitoring in Slack
  • Finops

Since we are in the cloud we need to monitor our costs. From time to time, we receive e-mails telling us that we have spent more than expected this month, or suspiciously more a day than the average. Thanks Adeo finops team !

Costs are monitored through dashboards too.

Integration within its ecosystem

  • Operational data exchange

Our popularity scores are pushed to the website through a Kafka stream. It’s our responsibility to export from Bigquery to Kafka and manage the streams in Conductor.

  • Data Mesh

As a data producer we need to expose it for the rest of Adeo organization. To do this, we publish our results in a public dataset, referenced in Adeo data catalog, with some quality metrics. And we do it through Terraform.

A data owner and a data steward on the team are responsible for managing access requests to our data and maintaining a semantic explanation in our data intelligence platform, Collibra.

  • Visibility to end users and consumers

Ultimately it’s not about NDCG or RMSE but clicks and sales. We have designed dashboards with PowerBI so that our internal customers can check on the performance of their business, relative to our scope.

A/B testing and feedback loop

  • A/B testing

A/B tests are performed regularly, so that we are sure any improvement seen offline is real and measurable on business metrics.

It’s the team responsibility to set up the experiment, analyze and communicate the result.

  • Other ways of gathering feedback

We share all the improvements we made along the way in a dedicated meeting once per week with other teams, and we publish a newsletter every month to our internal customers.

Documentation and Knowledge Transfer

Because yes ! It’s worth a dedicated tool. With Gitbook we can organize the documentation between projects (business goals, team members and their responsibilities), ML model (step definition, data definition, type of algorithm and why) and overall software architecture design.

With sufficiently detailed documentation a level one support can be autonomous and GitBook AI will help you find the answer quickly.

It’s easier to onboard new team members of course.

Conclusion

Let’s wrap everything up.

Our data product is designed so that every member of the team can commission their increment in complete safety, thanks to an appropriate workflow and automatic quality assurance.

In a short space of time, or even in a single sprint, we can safely implement what we have designed on our laptops.

We still have some way to go, deploying an online inference version, retraining less frequently and monitoring our models more closely. But we’re confident that we’ll get there autonomously and safely.

--

--