Machine Learning Pipeline at QuintoAndar

André Barbosa
Blog Técnico QuintoAndar
7 min readAug 26, 2019
Two strawmen talk, 1 stands on top of Linear Algebra. "You pour data on one side and answers come out" Stir if they're wrong.
xkcd is the best place to describe daily activities

Data Science (or DS) is an important hot topic where a considerable number of “best practices” are still being defined. There is no standard about how an ideal process of machine learning development should be stated. We’re sharing in this post how my team and I defined and set up a robust pipeline for Machine Learning to scale up things here at QuintoAndar! 🚀

What is a Data Product?

At our company, the concern of a Data Product is a product empowered by machine learning 😄

Let’s give a simple example:

The data is kinda fake, so this is why this wide range of values 😅

Our rent calculator is a simple example of an interesting data product that we have been developing so far, and I am going to use it as a use case example in this blog post 😃

The idea is simple: our company wants to make rental easy and cool so how could we help owners in a way that it is easier for them to rent their houses? How could we help tenants find an awesome place to live at a fair price?

How about developing something (a product) that recommends how much a real estate should be rent for? This is the idea behind our rent calculator! 🎉

Machine Learning is NOT the first option

If you think that machine learning will give you a 100% boost, then a heuristic will get you 50% of the way there.

Following Rule #1 of 50 rules of Machine Learning from Google, there is no need to rush for the state of the art models when you are dealing with real-world problems. Better than that, a simple heuristic can give you a lot of insights and help you to validate your main hypothesis. For example, you can take the average rent of houses in a similar region as your house of interest.

Once this is done, your team delivered some value to your company by making your clients happy. Awesome!

Someday you may want to improve your metrics and for that, you will need to go deeper. As a “second iteration cycle” maybe Machine Learning can be a useful approach 😄

We need to iterate quickly through our models

As an agile culture that we follow here, we need to respond to business changes and act quickly. In the context of machine learning, quick iterations are the key to success.

The best way to achieve this is by learning from data, with baby steps. In other words, create a model that uses simple features that you can generate (by feature engineering or by querying a database), deploy it, measure and improve it.

Collect the low-hanging fruits, one step at a time until your team’s model achieves better performance, creating a better experience for your users.

Improving a deployed model is not an easy task. It is done by software engineers and data scientists that team up to achieve new results. These cycles must be prioritized and always will grow in complexity, from putting a simple regression model into production to build a complex ensemble or novel deep learning architecture.

It is also necessary to take into consideration how data should be ingested and served, how should it be presented to the user (e.g., a single value vs. an interval). The market can also change during this process; business needs can be modified, or even the data sources can alter their statistical behavior in a short period.

As stated, we need to be prepared to respond to those changes. Therefore, how can we keep track of all of these challenges, experiments, hypotheses, and models and ensure that we are moving forward in a fast and reproducible way (you know, “it works on my machine” is not acceptable anymore).

Introducing our Machine Learning Pipeline

Inspired by Airbnb’s BigHead, Uber’s Michelangelo and Facebook’s FBLearner Flow our team started to think how could we address generic problems that usually shows up when we try to build machine learning products. The diagram shows what we have come up with:

QuintoAndar's ML pipeline. Describe main sections: Feature Store, Model Management & Model deployment and monitoring

At the top of the image above (blue rectangle) is what we defined as our data pipeline. Data should not come as it is in the production database. Data must be cleaned, aggregated, transformed into something meaningful to a machine learning model.

Sometimes, creating a feature is as easy as:

SELECT area 
FROM house
WHERE id = 'someid'

or as complicated as:

WITH base 
AS (SELECT neigh.type,
Avg(c.rent / Cast(neigh.area AS DOUBLE)) AS region_avg_m2_rent
FROM contract AS c
JOIN house AS neigh
ON neigh.id = c.id_house
JOIN house AS h
ON h.id_region = neigh.id_region
WHERE h.id = 'someid'
GROUP BY neigh.type)
SELECT some_10_line_pivot_function(*)
FROM base

Multiple machine learning models can share the same feature. So, other data scientists/engineers may need to do the same transformation to use in their solutions. Another common issue that can arise is that different people can work in similar ideas using slightly different data, leading to redundant features used by different models that make maintenance intractable sometimes. Therefore, we needed a way to “centralize” stuff and make some definitions a little bit canonical.

The Feature Store

Exposing features for machine learning models, data scientists and analysts and to a whole diversity of services is the role QuintoAndar’s Feature Store plays, which can be described as a place in which anyone can have access to features made before to have the results easier. If some feature is the result of some aggregation, this operation needs to be done only once. People generate these features through Feature Factory and send it to our Feature Store.

The concept of a Feature Store was first described in 2017 by Uber. It is a centralized and shared data layer that contains precomputed model-ready features to be used during training or serving.

Model management

The red rectangle is described as model management, which was designed to improve governance and quality of models being deployed and used in production. We spent some time developing a model ML template, which takes care of boilerplate code and helps the data scientists to get straight into model development. We were strongly inspired by the cookiecutter data science project, but we’ve made some modifications to it to make it fit our business needs.

Since the last quarter, we have been developing an internal library that aims to adopts a lot of common DS code (pandas; scikit-learn; deep learning libs) and generates a machine learning pipeline, from feature engineering to model deploy and serving. This guarantees reproducibility and we can also bring experiments to production in a less painful way.

Once the model is ready, we run it into Model Training Service, a framework also developed by us that automates the full machine learning pipeline and it automates an airflow DAG and goes through the following:

  • It retrieves production-level data
  • Performs automatic hyperparameter tuning through Bayesian Optimization (provided by Amazon Sagemaker)
  • Trains an optimized model
  • Performs model validation (by performing overfitting checking and other stuff)
  • Deploy it into production by making our serving API to point to the newly released model.

Model Deploy

The last part of our architecture is defined as model deployment & monitoring. As most of our services, we deploy our models to give near-real-time predictions in a Kubernetes cluster. Models should produce logs of their predictions, which are retrieved by our Kibana server. We also plan to dump those logs to our data lake, which will enable metric performance monitoring with Metabase.

By setting up all of this, as a data scientist, I can say that we can focus on what the coolest part of my job is: machine learning models 😎

So the hard “setup” work pays off!

QuintoAndar’s ML pipeline. Describe main sections: Feature Store, Model Management & Model deployment and monitoring
This is where I need to focus from now on

Our work is not done yet, and we are constantly improving our process. One thing that we are planning to release in a near future is an experiment journal tool (like sacred) so we can keep track easily about every hypothesis being tested by different teams in a single source.

The Machine Learning Pipeline at QuintoAndar was an effort put by several Data Engineers, Data Scientist and Software Engineers 🎉

If you liked this kind of way that we deal with Machine Learning problems, join us!

--

--