From a hack to a data mesh approach: The 18-year evolution of data engineering at leboncoin

Published in

leboncoin tech Blog

16 min readJan 11, 2024

By Simon Maurin, Lead Architect

Since 2006, leboncoin has been providing a classified ads service in France, growing to its current size hosting more than 60 million ads in 70+ categories — ranging from real estate to video games — and serving 30 million active users every month.

The company has grown exponentially since its inception. As a result, architecture, organization, and engineering practices have had to evolve significantly over the years.

Data engineering is one of the many fields that have been greatly impacted during this time. In our journey from a simple shell script to data mesh principles, we went through various stages involving business intelligence, data platform, event-bus backbone, and machine learning (ML). Each step has brought us progressively closer to established software engineering best practices.

This article provides a detailed account of those diverse steps undertaken to construct our current data engineering stack and ultimately culminating in the adoption of data mesh principles in our organization.

[2006–2012] One shell script to rule them all

*In the diagram, purple denotes information system applications, while blue signifies ‘data’ applications*

In the early days of the company, we adopted what we considered “a pragmatic approach to information access.” Data were directly fetched from the production database and application logs parsing using a shell script. After KPI calculation, an email was sent to the Marketing people so they could copy and paste the data into a massive 72-tab Excel document. On a daily basis.

While this strategy was quick and effective at first, it proved unsustainable in the long term. In addition to fetching data that wasn’t reliable enough, the script often slowed down production and the Marketing department found it very difficult to use the Excel spreadsheet.

Over the years, the shell script became a monster no developer wanted to deal with, preventing evolution.

[2012–2015] Business intelligence to the rescue

In 2012, we decided it was time to unplug the monster and build a clean data stack using business intelligence (BI) engineering.

We set up an ETL (extract, transform, load) application to fetch source data, or at least useful subsets of it, and store it in a cache. To optimize for sequential queries required for analytical work, data contained in the cache are converted to a “star schema” and stored in a columnar format. Once the data are transformed, they are loaded into a data warehouse (DWH) supporting a web application that enables the Marketing department to access pivot tables, dashboards, and simulations.

This new solution was well received by Marketing. It made it easier for them to analyze data and come up with new products. It also permitted, for the first time, the cross-referencing of data from different systems (e.g., ERP) and thus meeting the needs of other internal customers (notably Support and Sales people).

But on our end, frustration eventually set in because the BI system was extremely unstable and very hard to increment with new features.

During that period, 1 million classified ads were created each day using leboncoin applications, resulting in 10 million events per day and billions in historical depth. Reprocessing historical data was time-consuming and buggy almost every time.

When we were asked to add web navigation sessions data and their 100 million daily events, we realized our existing BI stack had major scaling issues.

Various approaches were taken to solve the problem: We boosted our machines, optimized our databases resources and configuration, changed our way of writing on disks, used extensive multithreading… But each time we overcame an obstacle, another would present itself soon afterward.

We had to face it: We had created a second monster and it had to be unplugged in its turn…

Tech management was not ecstatic about it and the pressure “to get something that simply works” increased accordingly.

[2015–2017] The data platform foundation

We decided to change our strategy. As vertical scaling was not enough, we decided to distribute things.

“Big data” was one of the hottest tech topics in 2015. Thanks to this, we found new tools and techniques enabling us to scale horizontally. More importantly it allowed us to consider the use of data, beyond business analysis, as an integral building block of our product.

Our objective was no longer to scale up the BI stack, but to build a data platform leverageable by both analytical solutions and data-driven products.

Strategy

Strengthened by this vision and our history of pain, we came up with a three-fold strategy:

To overcome our scale issues by:

— Switching to a horizontal scaling strategy, distributing data processing, and storage.

— Adopting a cloud computing solution to benefit from resource elasticity.

To adopt a platform-based approach by:

— Having ready-to-use decoupled components for data storage, inventory, orchestration, ETL…

— Storing raw data, rather than subsets, into a data lake that would become the source of all data pipelines.

To enhance the system’s resilience by:

— Using an “off-the-shelf” data pipeline orchestrator.
Our choice fell on Airflow, which had just been open sourced. It provides a graphical interface, APIs, and the ability to test the code and run multiple flows simultaneously.

— Investing in observability.
We leveraged Datadog to monitor the functional state of our data stack (data availability, production-environment delays, etc.), as well as our infrastructure metrics (I/O, network flows, number of machines available, etc.).

— Embracing eventual consistency.
As Tyler Treat says: “If you’re distributed, forget about ordering and start thinking about commutativity. Forget about guaranteed delivery and start thinking about idempotence.”

We stopped trying to ensure a coherent and synchronized view of all datasets. We accepted eventual consistency of our multiple data sources and started to design our data pipelines to be commutative and idempotent, i.e. replayable multiple times, in any order, without any side effects. This drastically improved the system’s resilience and its ability to “heal itself.” It also impacted very positively on the parallelizability of our data jobs, which by taking advantage of our new infrastructure elasticity has given us the ability to reprocess our full history on demand.

An inventory was also added to keep track of data availability and other “metadata.”

And now let’s wrap up all of this by introducing you to…

The data platform architecture

After two years of work, we came up with this result:

Let’s start with data ingestion. Most of our data still comes from application databases but we started to get data from our third-party analytics tool through a dedicated API. In both cases, data are being extracted (but not transformed) by “micro-batch” extractors. Then they are loaded into the data lake using Amazon S3, ready to be accessed for any usage.

The inventory allows us to track any entry flow and triggers data pipelines whenever data are ready.

Data pipelines are orchestrated with Airflow, which launches SQL scripts, as well as tasks on Spark, a unified analytics engine for large-scale data processing that allows us to distribute our processed data to an Elastic cluster. Everything is monitored by Datadog to ensure that nothing goes wrong.

For our BI solution, we also switched to a distributed database with columnar storage (Redshift), which had a dramatic impact on performance.

Using Tableau, the whole company is able to visualize refined data in dashboard forms. Data scientists and data analysts are able to query Redshift and S3 using Jupyter Notebook to obtain a finer grain of data.

And as expected, we’ve gone beyond the scope of business data analysis by creating our first data-driven products: One customer knowledge tool to support Sales people and one CRM solution to target users based on their behavior with email campaigns.

And that’s how we eventually kept our jobs!

Sounds nice, right? Unfortunately, we hadn't foreseen that the forthcoming reorganization of the technology teams would upset both our design and our way of working.

[2017–2018] Feature teams, microservices, and event streams

In 2017, the Tech team was reorganized into interdisciplinary feature teams, following Spotify’s model. The reorganization resulted in a change in the backend architecture, which had to move from one monolith to microservices.

Migrating from monolith to microservices

Rather than a single application containing all the business logic, we created multiple business domain-specific applications, each with their own storage system. A bus handles asynchronous communication between microservices and allows them to react to events emitted by each business domain.

From a data engineering perspective, this new architecture amplified an existing problem: The couplage between the data platform and leboncoin application database.

Indeed, backend engineers found themselves constrained from modifying business models in the database due to the potential disruption to associated pipelines and data-driven products, often without a full awareness of these repercussions. This situation led to a mounting sense of frustration between backend and data engineers.

As the new microservice architecture would increase the number of databases by one order of magnitude, it was obvious that this integration pattern couldn’t scale.

It was time for a contractualized communication interface!

The event bus, envisioned as a distributed log, emerged as a promising solution.

An intense discussion started between data and backend engineers to define the rights and duties of all parties.

Boy scout rules for data streams

Some event-stream boy scout rules emerged from this discussion:

Your domain, your data streams

Traditional data fetching by the data engineering team has been replaced by a decentralized approach. Each feature team now assumes ownership of the code responsible for generating their domain events and managing the evolution of their schemas. This shift in responsibility fosters a sense of autonomy and accountability within each team.

Treat event streams as APIs and schemas as contracts

The significance of event-stream data consistency, life-cycle management, and incident resolution is emphasized. Events are treated as integral components, akin to other APIs, with schemas representing binding contracts.

Choose your data-stream scope wisely

This introduces the distinction between public and private data streams, granting domain owners the discretion to determine accessibility. Public streams, open to any team, come with stringent guarantees, encouraging domain owners to thoughtfully consider their exposure.

Apply the following rules for all public event streams:

— Natural order is guaranteed for each partition.

— Avro serialization protocol is required (to enforce schema).

— Transitive backward compatibility is mandatory, which means processing the historic data since the last schema should always be possible.

— A common skeleton of events is managed with a UID for deduplication and a standardized way to provide business timestamps.

— The content of events is maximized-minimal: All useful domain business data are included from the beginning (even if there is no consumer), whether internal information like database sequences are not. When we have links to business entities from other domains, only IDs are listed. This is definitely a trade-off as it generates more coupling by default while improving reusability of these events.

— Documentation must be provided for all fields and schemas.

Normalize entities name and type

This facilitates a more seamless integration of data across different domains.

Follow Kafka configuration standards based on your data stream type

To ensure that the configuration aligns with the specific requirements and characteristics of each data stream, optimizing performance and reliability.

Introducing microservices into the data architecture

Below is the revised data architecture:

The microservices write in the event bus, while an archiving application stores the public topics in the data lake then updates the inventory. It’s worth noting that the data lake was now organized by domain (like feature teams and their code base).

Here, we’ve indeed achieved explicit interface and enforced guarantees. Plus, we now have the same pattern for service-to-service communication and service-to-data platform communication. Both the backend and data engineers are happy.

The company’s board of directors, however, was not so fond of our “pipe stories” and urged us to channel our investments into ML initiatives…

[2018–2019] The recommendation engine

At that time we had already developed a few tools based on ML. But they were based on offline processing, focused mainly on user segmentation, and were held together by duck tape. Our new aim here was to use these techniques for the first time within the leboncoin product. This meant building a fully fledged service that adheres to our standards of industrialization and quality.

We chose to build a recommendation engine for classified ads, grounded in the concept of similarity between user navigations — a collaborative filtering technique.

First attempt to industrialize ML

Let’s take a look at how our data team worked at the time.

The process commenced with an iterative experimental approach:

1/ Data engineers explored, extracted, analyzed, and cleansed offline data to offer ready-to-use datasets to data scientists.

2/ Data scientists, armed with these datasets, designed recommendation models and trained them in their individual environments, using tools such as notebooks and Python libraries like Gensim or TensorFlow.

Upon identifying a promising model candidate, the industrialization process kicked in:

3/ ML engineers wrote an ML pipeline that automated steps from data extraction to model validation. This pipeline triggered a CI/CD process, which generated the model in our various environments.

4/ The model was served through a backend service running into our Kubernetes cluster, mirroring the deployment process of other leboncoin application services. In real time, the model’s behavior was monitored, ensuring its responsiveness to dynamic data.

Let’s now see how it impacted on our data platform architecture.

Integrating ML into the data architecture

ML pipelines were managed using Airflow. Model serving was done thanks to Kubernetes. Real time navigation data were accessed through the event bus and cached in a Redis.

The recommendation engine was quickly adopted by our users and has been complementing our search experience ever since.

It looks like we’re buckled in and ready to go now, doesn’t it? Yes… but no. As the Tech and Product teams’ head count doubled between 2018 and 2019, we had to rethink how to organize.

[From 2019] Here comes the data mesh approach!

The blue line represents the evolution of the leboncoin Group head count, measured on the left, while the green blocks represent the evolution of the Tech and Product teams’ head count, measured on the right

As already mentioned, and as you can see in the graph, the Tech and Product teams doubled in size between 2018 and 2019. Our feature teams’ organization allowed us to scale quite efficiently thanks to work parallelization across domains.

However, data people still worked in a central data team separate from feature teams, and in no time data engineers were outnumbered by other software engineers. This resulted in a bottleneck effect impacting on new data-driven product creation.

We understood at this point that it wasn’t enough to have a few specialized teams able to build these data-driven products. If we wanted them to have an actual impact, we were tied to spreading those skills across the entire company.

Introducing data mesh

It was with this in mind that we started looking at the concept of data mesh. It had recently been introduced by Zhamak Dehghani, then a consultant at Thoughtworks, in an article entitled How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.

*A diagram of data mesh architecture from a 30,000 foot view, Figure 12 from* *How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh*

The essence of the data mesh philosophy lies in the decentralization of data engineering expertise and code. Dehghani advocates for the integration of these capabilities into feature teams, thereby expanding their responsibilities beyond conventional realms like services, applications, APIs, and event flows. Now, feature teams are entrusted with the ownership of pipelines and the development of “data products.” Crucially, in the spirit of collaboration inherent in the data mesh model, these data products become modular building blocks usable by all other teams.

The shift marks a departure from the unscalable practice of feature teams tossing data engineering needs over the proverbial fence, replacing it with a holistic, end-to-end domain-based approach. This approach not only streamlines the data engineering process but also optimizes the reuse of data across various teams, fostering a more interconnected and efficient ecosystem.

Those ideas of disseminating data engineering skills in a decentralized, interoperable, and self-service-oriented environment seemed to us a sound strategy to empower teams in constructing their own data-driven products.

We set about implementing it.

New roles

We made three changes in our people organization.

Firstly, central data engineers became a “platform team” (in the sense of team topologies). They began to focus on creating self-serve data infrastructures rather than building data-driven services for other teams.

Secondly, we started putting data engineers in feature teams. Since we had about 40 feature teams at this time, we couldn’t put one data engineer in every team, but we could test this approach in those that needed it most.

Thirdly, we offered to train software developers in data engineering.

New tools

Regarding the technical side of things, data mesh implementation relies on two pillars: Standards to ensure interoperability and a platform to facilitate self-service of data infrastructure elements like Kafka flows, S3 buckets, Spark clusters, and orchestration tools.

We were quite optimistic about our ability to embrace data mesh interoperability standards as we adopted most of them when we switched to an event-based architecture (as seen above). In particular, the platform team’s organization, code base, and data lake were already domain oriented. Our boy scout rules just needed to be completed by the capability of using a data lake as an API in the same way an event bus would and by adapting security and privacy accordingly.

The data platform was a more complex issue. It had to be transformed from an internal tool of the former central team to a self-service tool for all software developers. To achieve this, the transformation focused on three main initiatives:

Data discovery

We created a data-discovery tool that aggregates data from all sources (event flows, data lake tables, BI dashboards, etc.). Using this tool, software developers, product owners, and data scientists can analyze their data and build models based on it.

Self-serve data infrastructure

We started by building a tool for data-stream infrastructure provisioning automation.

From the user perspective, the developer starts by filling a templated configuration file describing the metadata of its data stream: A typology (job queue, event flow, buffer, etc.), a T-shirt size, the scope (public or private), an encoding strategy and schema, privacy information, and an attached domain and owner team, as well as a list of producers and consumers.

They then push it to a Git repository. This triggers a CI/CD process that will validate the configuration to make sure it respects all the standards and is releasable.

Once done, the CD provisions Kafka topics in all environments, posts new Avro schemas in the schema registry if necessary, creates application secrets in Vault, provisions the S3 bucket as well as the Athena schemas (which is a tool that allows us to perform SQL operations over the data lake), and finally triggers the archiver to archive the topic in S3.

This kind of self-serve data infrastructure has proved to be very beneficial to the entire Tech team. Developers describe their business needs at a high level of abstraction, avoiding dealing with the physical aspects of infrastructure releases. Additionally, it allows defined standards to be enforced by code.

We’ve applied this philosophy to other kinds of infrastructures, such as a lakehouse, Spark cluster, and notebooks.

MLOps tooling

Our way of building ML-based products has evolved since the early days of our recommendation engine inception.

By embracing the MLOps philosophy, we now integrate data engineers with data scientists, aiming to construct ML pipelines from the experimentation phase. These pipelines undergo automatic deployment through CI/CD, aligning the processes during experimentation closely with those during recurring deployments. This approach ensures that individuals are exposed to production constraints early in the experimentation phase.

In support of these novel practices, we’ve introduced a golden path, offering ready-to-use pipelines “off the shelf.” This approach has proven instrumental in curbing structural discrepancies across our diverse ML projects. In tandem with our data mesh standards, it has drastically facilitated collaboration among data scientists dispersed across multiple teams. Moreover, these tools have spawned an active inner-source community who also deepens collaboration.

Additionally, we’ve incorporated features into the CI/CD pipeline to monitor datasets, model versioning, and statistics. The statistics of a model can therefore be tracked from the experimentation phase onward.

Finally, to prevent the duplication of features for every requirement, and to standardize their definitions and utilization, we established a feature store. This centralized solution allows features to be served both synchronously and in batches.

We’re all software engineers

This process is still in progress, and the level of data engineering maturity varies significantly across our teams. However, we have transitioned from having a single specialized team to now having approximately 20% of our teams proficient in building projects involving data engineering and/or ML. It has had a very positive impact on parts of our user experience, such as the search engine.

Our commitment to the data platform strategy continues, concurrently allowing some autonomy for our teams. Notably, experienced teams have successfully introduced new engineering practices and tools to the ecosystem.

Yet the most challenging aspect of this data mesh transformation lies not in building tools but in scaling data engineering skills in a population of hundreds of developers. While local hiring and training initiatives can contribute, they prove to be too costly for a global solution.

Experience has shown that pairing up data scientists and data/backend/ML engineers works pretty well and allows them to learn from one another. Backend engineers often assist data engineers in honing their skills and adopting best practices related to CI/CD, while data engineers contribute their expertise in SQL and data engines internals, for example. Dehghani refers to this phenomenon, quite poetically, as “cross pollination”. The challenge here is to bring these roles together in the same team on a permanent basis.

To catalyze those cross-skill pollinations, we are leveraging our tech guilds through promoting the exchange of best practices and experiences between our various developer communities.

Ultimately, it’s a slow but naturally emerging evolution.

We’ve finally realized that, even if we don’t have the same starting point, at the end of the day, we’re all software engineers.

Thanks to Stéphanie Baltus, Nicolas Goll-Perrier, Sabrine Djedidi, Walid Lezzar, Sam Bessalah, Hao Ren, Marien Pinot, Céline To, Bastien Louvel, Raphaël Auvert, Thomas Bouhon, Xavier Krantz, Clément Delpech, Mélanie Armand Madec, Pierre Peltier, Clément Demonchy, Olivier Buchy, Damien Mourot, Pauline Nicolas, Jérémie Smadja, Mehdi Dogguy, Frédéric Cons, Stéphanie Leplus, Julien Conan, Julien Jouhault, Guillaume Grillat, Anne-Laure Civeyrac