The Flyte journey at Wallapop

Published in

Inside_Wallapop

9 min readApr 22, 2024

At Wallapop, we have built a Machine Learning (ML) Platform to facilitate the lives of our Data Science team. One of the core technical elements that influence our ability to achieve this objective is the ML pipeline orchestrator. We have adopted Flyte, a Linux AI & Data Foundation open-source project, used in production by companies like LinkedIn, Spotify, Warner Bros and Discovery.

In this post, you’ll learn about our journey with Flyte, why and how we have implemented it, and what it means for our users.

Wallapop

Wallapop is a second-hand marketplace founded in Barcelona in 2013, that currently operates in three countries: Spain, Italy, and Portugal. On Wallapop, it’s possible to buy and sell products from all consumer goods categories easily, quickly, and securely, and it’s also a reference point in categories like automotive, where it’s a leader among individuals in Spain.

Wallapop aims to create a unique inventory ecosystem of reused products that facilitates a more humane and sustainable consumption model. Its mission is to empower people to embrace a more conscious and human way of consumption, by connecting sellers and buyers of second-hand products looking for great pricing and convenience.

On average, Wallapop experiences a monthly influx of about 19 million active visitors from Spain, Italy, and Portugal that creates yearly 100M listings.

As you can imagine, Wallapop generates a lot of data, and by using ML we can leverage it.

History of ML at Wallapop

Wallapop started its ML journey in 2016. Back then, Data Scientists used to develop their models locally in notebooks, and on their laptops. The few models that reached production, like scam detector, catalog moderation, etc., were also deployed and managed mostly manually.

By 2020 a few models adopted Luigi and Jenkins as part of the ML pipeline architecture, but we soon realized that, for us to be able to scale and implement solid ML solutions we needed much more: we needed to build a proper ML Platform. We started this process in 2022.

By that time, the ML Engineering team was formed and the first thing we did was focus on adding Continuous Integration, Continuous Delivery, and Continuous Training (CI/CD/CT) processes using AWS Sagemaker Pipelines as our ML pipelines orchestrator.

The ML Platform had to be available quickly to deliver a solution to the Data Scientists. With Sagemaker, we didn’t need to manage an infrastructure layer like Kubernetes, but rather just use Sagemaker’s SDK and build a good project structure. We had a few models in production with this architecture, and while it allowed us to quickly build a first iteration of the platform, we found some challenges in using Sagemaker pipelines:

No local execution: we wanted faster model iterations but we couldn’t easily run the full pipelines locally, only some tasks. Instead, we had to upload the pipeline every time and wait, wasting a lot of time and resources.
Hard to debug: all the logs were only available on CloudWatch and, for our Data Science team, it was challenging to read and make sense of them.
Big learning curve: the SDK can be too verbose, the documentation is not as detailed as one would expect, and the communication between tasks can be a bit complicated. Also, the SDK has some particular features that can be too cumbersome for our Data Scientists. For example, each task needs to know the exact S3 path where the data is saved. There are specific classes for input-output, which don’t always work the same depending on the task type. Additionally, there are considerations for how files are mounted into tasks, etc.
Manual environment isolation: we had to imperatively define which pipelines were meant to run in production or development environments by using tagging and specific nomenclature.
Hard to track runs and their data: we needed to dig up more than we expected to track versions and which data was used in each pipeline execution.

In the end, we were not iterating ML projects as fast as we wanted, so a redesign of the ML pipeline stack was necessary.

During a few months, we researched the market looking for a solution, built some POCs, and finally, we decided that Flyte was the tool of choice.

Why Flyte?

Some of the improvements in our ML development process after adopting Flyte include:

Much faster model iterations: the task and the workflows are pure Python. We can just write Python functions and execute them locally, without needing to upload anything to the cloud. Previously, we needed to wait on average 10 minutes to start a new pipeline execution to validate its changes. Now, we can do it instantly.
Debugging like any Python code. Our Data Science team is proficient in Python, so learning how to write Tasks and Workflows in Flyte was faster and easier.
Easy to manage environments: due to certain features available in Flyte, like projects and domains, handling environment isolation is simple and native.
Execution metadata is saved automatically to S3 or S3-compatible blob storage, enabling us to easily track executions and their data.

Within 4 months, we were able to implement a production-ready Flyte platform from scratch, becoming our new ML pipelines tool. Also in 2023, we added other components like DataHub for model and data lineage, DBT for data modeling and Monte Carlo Data for data quality and monitoring. We still use SageMaker for experimentation and development via cloud notebooks (SageMaker Studio), as well as for the deployment of real-time model endpoints (Sagemaker Endpoints).

How is Flyte implemented at Wallapop?

We run Flyte on Amazon EKS. Besides the regular components of a Flyte deployment, including blob storage and a relational database, we added some particular elements:

Notifications with AWS Lambda: it is configured to enrich the baseline workflow notifications that Flyte provides, adding metadata, ownership information, and message formatting. Once a workflow execution finishes, the system sends a Slack notification reporting the status, be it Success/Failure, enabling the Data Science team to be more autonomous in troubleshooting possible issues.
Karpenter for cluster auto-scaling: it runs well with Flyte in situations where, for example, a Task requests specific resources (eg. 2 CPUs and 6GB of memory). If there isn’t a node with enough resources to meet that request, Karpenter will create a new node instance with the best price/performance ratio so we don’t have to worry about the specific instance type to use, allowing for an easier and simpler cluster auto-scaling.

Project structure

In the quest to enable our Data Scientists to be as autonomous as possible, we saw the need to build a project template with the following components:

mlplatform-project-template: a Github repository that uses the cookiecutter Python library to create new projects based on a template with simple commands. This is the baseline project that’s designed to follow a set of software engineering best practices, simple enough for Data Scientists to use. It includes all the infrastructure details, like AWS roles, CI/CD configuration, unit tests, linting, etc. Using this repo, the Data Scientists only need to create a new project from the template and focus on developing their pipeline Python logic. They don’t need to worry about CI/CD/CT integrations, Dockerfiles, or any other infrastructure component. It follows the folder structure summarized in the diagram:

mlplatform-poc-california: it’s an example implementation that follows the ML project template. Using this repo, Data Scientists may learn by themselves how to, for example, query Amazon Redshift to retrieve data for a particular step in the pipeline, or how to deploy an ML model with Sagemaker.

Conclusions and future work

The new Machine Learning Platform has significantly enhanced our Data Science development experience, enabling faster iterations and more efficient model development. This has been particularly evident in three key areas:

1. Catalog Quality: Our catalog is the heart of Wallapop, and maintaining its quality is both a priority and a challenge. Each item listed is unique, making standardization and categorization complex tasks. However, the agility provided by our new ML Platform has allowed us to develop sophisticated models that effectively tackle these challenges. These models ensure that our catalog remains comprehensive, accurate, and user-friendly, thereby enhancing the overall user experience.

2. Upload Experience: We strive to make the process of uploading items as seamless as possible for our users. The new ML Platform has enabled us to create models that simplify this process, reducing friction and making it easier for users to list their items. These improvements have led to an increase in user engagement and satisfaction.

3. Churn Models: Understanding and addressing user churn is crucial for the growth and sustainability of our platform. With the new ML Platform, we have been able to develop models that help us identify users who may have stopped using our services. These models provide insights into the reasons behind user churn, enabling us to devise effective strategies for improving user retention and engagement.

While it was challenging to choose which ML pipeline tool was the best fit for our company and we spent quite a bit of time searching in the vast MLOps ecosystem, we are very happy we chose Flyte, as we have been able to productionize a complex ML model in 2 months, enabling a newbie Flyte developer to manage it with ease.

With the new ML Platform, we are not just reacting to changes but proactively shaping our user experience. We are excited about the possibilities this platform opens up for us and look forward to sharing more updates in the future.

This has been Wallapop’s ML team’s first time collaborating closely with an open source community and the experience has been exciting.

Indeed, this post is based on a recent Flyte Community Meeting we took part in, of which you can watch the complete presentation here.

At Wallapop, we’re constantly on the lookout for exceptional talent to join our team. If you’re passionate, driven, and believe you could be a great fit, we’d love to hear from you! Feel free to explore our current job openings and find your perfect role.

Questions from the community

We presented recently at the Flyte Community Meeting and some questions sparked interesting discussions.

Q1: How did you configure Datahub for lineage? What sorts of information are you persisting?

DataHub is also a really new project for Wallapop. It was deployed into production just last year. It was led by the Data Engineering team, with the participation of the ML team..

We use DBT to create the dataset for the ML model, which is then automatically tracked into Datahub. Then, we just need to link our model to the DBT table using a custom YAML configuration file in our project repository.

The YAML can be expanded with the information we detect that could be necessary. For the moment it just contains the very basic information for lineage; model and endpoint name, training pipeline name along with supporting infra (Sagemaker or Flyte), and of course the data source name and location.

Q2: How much does your Data Science team have to know about Flyte to be effective?

They need to know a little bit about Flyte’s core concepts, including what’s a Task, Workflow, LaunchPlan, and Project. Especially because they write the Task logic.

The one thing we found was harder for them to understand is that the code you define in the workflow function is not pure Python but a subset of it, a DSL. For example, they can’t just write an if or for clause in the workflow definition but they have to use special conditional objects.

Besides that, the learning process was relatively fast.

Q3: In your architecture diagram, you talked about sending metrics to Prometheus and then having a Grafana stack to visualize them. What sorts of metrics do you care about? Is it resource consumption, or is it more SLAs on workflow completion?

Both are important for us. The metrics we track include number of pipeline executions, domains (production/development), iteration time, and failed executions.

Those are some of our KPIs and we use them to assess if the platform is working correctly or not.

In case the project has defined some SLAs, we use Grafana alerts to visualize them in the dashboard and notify us in case the agreement is broken.

To join an upcoming Flyte Community Meeting and learn from ML/Data practitioners, add the event to your calendar.