How a POC became a production-ready Hudi data lakehouse through close team collaboration

Published in

leboncoin tech Blog

7 min readFeb 12, 2024

By Xiaoxiao Rey, Data Engineer, and Hussein Awala, Senior Data Engineer

Around 8 million unique visitors visit leboncoin each day — in 2022 it had over 100 billion HTTP calls per month and 700 applications up and running — making it one of the most visited French websites.

The Data Platform team, in charge of building and maintaining platform infrastructure as well as developing internal APIs, is responsible for archiving leboncoin’s production data — a huge amount of Kafka events — into a very large data lake that was accessible to all teams.

The solution worked for a while, but then European General Data Protection Regulation (GDPR) compliance became an issue. The law stipulates that users with closed accounts should be deleted after 3 years, and inactive users after 5 years. Because data put into the lakes is immutable, the team wasn’t able to easily erase the data of users who requested their accounts be deleted.

So they decided to build a proof of concept (POC) for a data lakehouse using Apache Hudi to test whether this suited their needs better.

This article explains how they were able to turn a POC into a production-ready data lakehouse that is now used by 5 teams at leboncoin and Adevinta (the group that owns the company), thanks to close collaboration between the Data Platform team and the Customer Relationship Management (CRM) Feature team.

Building a POC for the Hudi data lakehouse: A one-year project for the Data Platform team

The right tool for the job

To comply with GDPR, in 2022 the Data Platform team decided to migrate the legacy data lake to a new design based on an open table format known as lakehouse. Three options that allow data snapshots to be taken and deleted as necessary were available to them: Delta Lake, Apache Iceberg, and Apache Hudi. The team went with Hudi after several benchmarks and tests.

Faster processing

This migration resulted in faster and cheaper ETL (extract, transform, load) pipelines, as Hudi automatically provides appropriate-sized files to solve the small file problem often encountered in data lakes. And thanks to transactional queries, records in tables could now be updated or deleted. Several new features were also available, such as indexes on tables and the ability to query old snapshots of tables, also referred to as time travel functionality.

Extending data lakehouse use

Because of the value the data lakehouse brought, the Data Platform team quickly began considering using this project for more than just archived data. Indeed, there were other use cases to consider. Tables were created in a data warehouse (Amazon Redshift) for the purpose of deleting and updating data, which was not possible in the traditional data lake (but is now possible in data lakehouses). The data warehouse also provided low latency, while a data lakehouse was able to achieve better performance with no limit on cluster size by parallelizing queries.

Outcomes

Lakehouse implementation schema

datalake-archive, where stored data arriving from all microservices are partitioned by Kafka date and hour and written with Apache Parquet;

datalake-ident, where sensitive data are deleted in compliance with GDPR and partitioned by real event date and hour;

datalake-pseudo, same as datalake-ident but personal and confidential columns are pseudonymized, also partitioned by real event date and hour.

New lakehouse

Implementing the Hudi data lakehouse in production: A project in close collaboration with the CRM team

Phase 1: Considering context

The CRM team was considering using a data lakehouse at the time for two reasons:

1/ They were in the process of migrating from Adobe Campaign version 7 to version 8.

Since they needed to build new data pipelines to feed this new Adobe instance, it was time to think about a new data architecture and model that was no longer sourced from the data warehouse but directly from the data lake, and create their own data lakehouse, where they pre-calculated the tables needed by CRM data pipelines (that were previously in the data warehouse).

The data mesh approach was used as inspiration to consolidate CRM data in one place and eliminate unnecessary dependencies on other teams.

2/ It was already an ongoing subject to kill dependency on the Redshift data warehouse maintained by the Business Intelligence (BI) team, who precalculated a lot of tables upstream.

This caused many problems for the CRM team, who had to wait for BI processing to be completed before they could start their own processing. In addition, the BI team complained that the high amount of processing on Redshift was taking up too much time and resources, since Redshift was not designed for frequent processing.

Phase 2: Conducting workshops with data leaders and architects

It was still unclear which technology would be used to solve the CRM team’s problems. So they organized workshops with data leaders and architects of their tribe to see what was available on the market and what other companies were using. That’s how they came up with the data lakehouse solution, which allowed them to consolidate all their data in one place and manage processing without having to rely on other teams.

Phase 3: Discovering the Hudi data lakehouse POC

It all happened at the Data Guild meeting that takes place every two weeks with the aim of sharing knowledge. The CRM team learned that the Data Platform team had already been working on developing a data lakehouse using Hudi. It seemed like a good thing for the CRM team to become part of since they couldn’t implement a new technology from scratch with only 3 data engineers on the staff, so they asked to join the project.

But the story didn’t start out as smoothly as we would have liked! At first, the Data Platform team showed the CRM team how to use Hudi and told them they could now create their own tables. But it turned out that some functionalities needed by the CRM team were not yet implemented. When they went back to the Data Platform team, they refused (as CRM asked quite a lot), claiming that the CRM team use case was not on their roadmap and that the Hudi data lakehouse project was supposed to remain a POC.

Phase 4: Building a close relationship with the Data Platform team

There was no way for the CRM team to go back to being dependent on the BI team, and the BI team didn’t want them to process data in the data warehouse. So it was necessary to move forward with the data lakehouse: It was the only option for them.

After many discussions between the CRM and Data Platform teams, it was agreed that Data Platform would help CRM implement new Hudi features that had not been implemented initially: The init feature, for example, which allowed them to create empty tables, was necessary for self joins and backfills. Furthermore, the Data Platform team would help them debug to find out why table processing would go from taking minutes to one hour without any obvious explanation, selecting the right index to use for better performance.

Phase 5: Working together to support multiple tables

At this point in the project, each data lakehouse table had only one data source table, with no transformations or aggregations allowed.

After several months of collaboration with the CRM team, which had a use case the Data Platform team could apply, an extension of the data lakehouse and an Airflow plugin were created. The new product accepts a SQL query and a small YAML file describing the table configurations to automatically create the table and an Airflow DAG (directed acyclic graph), containing the job planned to insert data into the table.

Let’s talk results!

16 tables in production

So far, a total of 16 CRM tables are in production in the Hudi data lakehouse (out of 400 tables), which can be updated or deleted just like in the data warehouse. Among them is the classified ads table, which contains 41 million active lines with historical data spanning one month. Between 10k and 130k lines are updated each hour, which takes approximately 5 minutes. Hudi is also used to add, update, and delete data in some dashboard campaign tables.

5 different user teams

Currently, more than 5 teams use the Hudi lakehouse at leboncoin and Adevinta. The Data Platform team members themselves prefer to use it to create tables because of the Airflow plugin (previously they had to use a customized Spark job and Python scripts to create the Airflow DAG).

What’s next?

The Data Platform team is still working on the project to make the data lakehouse evolve by:

- Adding new features, such as clustering and recording level indexes, to improve reading and writing performance on tables.

- Implementing incremental queries (merge on read) to update the tables more frequently: For instance, an update every 2 or 5 minutes to replace the hourly update that’s currently in place.

- Supporting dbt, the standard data transformation tool.

- Increasing the number of teams using the Hudi data lakehouse.

- In the long term, replacing the entire data warehouse with the data lakehouse.

Main takeaways

These are the 4 main points you should note from our experience:

- A POC doesn’t mean it can’t be turned into something you can use in production.

- Guild meetings are crucial to sharing information among people and benefiting from each other’s efforts: You will often be less efficient if you work on your own.

- Collaborating on common projects with other teams is vital in technical jobs: You must be able to communicate and work with others in order to achieve great things.

- You don’t need to wait for a feature to be supported in an open-source project, if you find a bug, fix it, and if you need a new feature, take the initiative and implement it.