From the notebooks to the user: ML Platform at QuintoAndar

A glimpse in the journey of building our machine learning platform to deliver better ML-powered systems.

Lucas Cardozo
Blog Técnico QuintoAndar
7 min readFeb 14, 2022

--

Image with a notebook illustration and the title "From the notebooks to the user: ML Platform at QuintoAndar"

Context

As all things Ops we have already heard about in the past few years — DevOps, DataOps, GitOps — MLOps seems to be one of the youngest siblings. Besides sharing the suffix, these subjects share another common characteristic: they were pretty hard to define in the beginning!

When the oldest child of the family, DevOps, was created, we would see people fighting all over the internet to try and define it. Is it a role? Is it a set of practices? Should you have a DevOps team in your company? Finally, the community seems to have settled in by broadly defining DevOps as a set of processes, practices, and tools that aim to accelerate robust software delivery 1,2,3,4,5.

By better defining DevOps, we’ve opened the doors for new domains that could benefit from similar ideas. First, DataOps came to accelerate and standardize the delivery of data analytics-related products. Not long after that, the MLOps term was coined. It emerged as the result of looking through the DevOps lens to the challenges faced by data scientists and machine learning engineers when building and delivering ML-powered systems.

By the end of this post, we expect you to grasp why MLOps is relevant and how we use it here at QuintoAndar to accelerate the delivery of ML-based products!

Machine learning lifecycle refresher

One way to make the role of MLOps a little clearer is to understand what are the problems we are trying to solve with it. When building ML-powered systems, a data scientist (DS) goes through different stages. We will briefly describe them here, just as a refresher! (if you are already familiar with these stages, feel free to skip this section 😄) .

The exploration stage is always the entry point. Here is when the data scientist’s creativity shines as it talks to business stakeholders, looks for different variables, modeling tasks and algorithms to solve an actual business problem. Things tend to become very messy at this point, so flexibility and room for experimentation are key for success!

With better-defined goals, it is time to train and tune the model. At this stage, the expected output is an artifact of the trained model. As you might already imagine, this step generates the core element of an ML-powered system.

Contrary to what the average DS-related content on the internet shows, having a trained model is far from being the last step. To deliver the value that an ML-powered system has to offer, your trained model must be deployed/distributed and integrated with other services that will benefit from it in production.

Finally, when you think it’s over, you’re reminded that ML systems are much like live organisms. After deploying it — or giving birth to it, to keep on with the analogy — , you have to keep an eye on it by continuously monitoring its features, predictions, and performance. This way you can make sure the system is healthy or, otherwise, act quickly to avoid disasters!

Confused John Cena in the fight ring
Confused data scientist when it finds out it needs to monitor the model after deploying it.

MLOps at QuintoAndar

Hopefully, at this point, it’s a little bit clearer that the role of an MLOps team or platform is to make sure all steps of the modeling lifecycle are as efficient and robust as possible by providing the tools and standardized processes to achieve these requirements. Examples of such tools are:

  • feature stores, to make sure data is available at training and prediction time;
  • experiment tracking services, for keeping the full history of experiments;
  • monitoring services that keep track of all deployed models’ performance.

By taking part of the responsibility off the data scientists’ shoulders, we let them focus on what they do best — solve business problems!

Now, it is finally time to share how this is actually done in practice here at QuintoAndar. In broader terms, the role of the MLOps team at QuintoAndar is to spot common improvement points and develop solutions that are general enough to be used across all teams working with ML-powered systems. Finally, these solutions end up becoming part of our Machine Learning Platform. We’ve talked about it before. The platform has evolved since then and will continue to, hopefully for the better! 😃

Machine Learning Platform Overview

QuintoAndar's machine learning platform overview. Composed of three layers, each with its own set of services and tools.
General overview of the layers and components of our platform.

We like to think of our ML Platform in terms of the layers that form it. First, the data layer is where all the components responsible for either computing or serving computed features live. This layer is composed of things such as our feature pipeline orchestrator— responsible for executing feature pipeline Spark jobs (powered by butterfree) and catching data quality issues — and our online feature store — a Cassandra cluster that serves thousands of read requests per second for business-critical models.

Within the modeling layer live all the components created to streamline and standardize the experimentation and training phases. Here we have a custom modeling library in Python — thin wrapper on top of scikit-learn — where components for performing common modeling tasks are implemented and shared among all squads. Additionally, our training pipeline orchestrator defines standards and abstracts several steps for training/hyperparameter tuning for all models.

The serving layer, as the name suggests, has components responsible for serving/deploying the trained model artifacts. Here, most of the standards and tools are currently defined at a lower level by QuintoAndar’s SRE team, being shared among all engineering teams across the organization.

Last but not least, we are introducing the monitoring layer. As mentioned earlier, monitoring is key to closing the modeling cycle and reacting to changes in data before it’s too late. Tools for monitoring both feature data quality and predictions are currently being integrated with the rest of the platform.

Platform in action: a small glimpse

Here at QuintoAndar, we have several models that are key for our core business. For instance, our search engine is powered by a personalization step composed of several models that aim to return the best possible listings for each user. To rank the search results, each of these models must retrieve feature values with low latency from our online feature store. On top of that, we need to simultaneously handle hundreds of similar operations per second — high throughput —, which brings challenges in terms of scalability for our platform's infrastructure.

This is just one of the several other use cases inside QuintoAndar in which the ML platform became a critical component for improving the experience of our customers!

The MLOps Team

Different from other kinds of platforms, prior knowledge of the machine learning lifecycle is key for building a Machine Learning Platform that solves the problems it proposes to solve. This, however, is not a sufficient condition.

Building such systems comes with its own set of challenges that require strong engineering and systems architecture skills to solve. With that in mind, our team is composed of software engineers, data scientists, and machine learning engineers that work together, each with their own expertise, to deliver the best possible solutions. As we thrive in diversity, different backgrounds, skills, and opinions are encouraged within our team!

SWEs, MLEs, and data scientists joining forces to build the best ML platform!

Final thoughts

As you might imagine at this point, building a robust platform is no weekend-long side project. As a small platform team in a fast-growing company, we need to stay humble and always work side-by-side with other teams. That being so, we can smoothly introduce features without interfering too much in the roadmap or the development process of the teams that should benefit from our solutions. By clearly defining boundaries and responsibilities, we avoid stepping on each other’s toes. 😃

According to 2021’s Gartner Hype Cycle for Data Science and Machine Learning, MLOps is still pretty much close to its peak in terms of expectations. A naturally correlated aspect of this hype is an ever-increasing ecosystem of third-party solutions for solving problems we might encounter while trying to robustly productionize models.

Two kids with a row in a kayak that is sinking in the river.
Trying to leave the build-or-buy hole when picking/developing MLOps solutions

With that said, the MLOps ecosystem is still far from being mature. There are no clear winners both in terms of third-party solutions and general architectures to solve most of these problems. Critical thinking and evidence-based decision-making are powerful tools to guide us towards defining what works or not for our use cases. By keeping our build-or-buy-o-meter updated, we try to make the best decisions while still controlling for future technical debt.

Finally, here at QuintoAndar, we’ve learned that the best strategy to navigate this uncertainty is to focus on the next smallest step that maximizes the positive impact on our products and stakeholders. Sometimes small changes in the process may have a huge impact on the perceived value of the platform from those who benefit from it!

I cannot forget to mention that none of these would be possible without the help of our amazing team! Thanks, Mayara Moromisato, Ralph Rassweiler, Guilherme Bonaldo, Pedro Andrade, André Silveira, and Igor Gushiken for the insights and support. 😃

If you are curious to know more about what we are up to in the MLOps team, stay tuned for future blog posts where we will get more into the details of the engineering challenges we face here! 🤓 Also, if you want to be part of an innovative team and contribute to top-notch projects like this one, check out our open roles!

Did you like the content of this article? Do you want to work at QuintoAndar in the MLOps team? Click here to check the details and apply for the position.

--

--