leboncoin tech - Medium

Beyond the hype: how a custom multimodal transformer beat our fine-tuned LLM

leboncoin tech — Wed, 29 Apr 2026 05:52:22 GMT

By Louis-Victor PASQUIER, Senior ML Engineer in the Cognition team @ leboncoin

In our previous post, we shared how just one hour of fine-tuning a Large Language Model (LLM) effortlessly outperformed three weeks of complex RAG engineering. This was a big step for the team.

The fine-tuned LLM was smart and capable. It proved that sometimes, letting a model learn dependencies directly from data is far superior to hardcoding them. But at the end of that post, we left you with a cliffhanger.

Our previous experiences with LLMs made me realize they are very impressive tools, capable of zero-shot and reasoning capabilities out of the box. The fine-tuned LLM we built was smart and even more capable. However, as I looked at the production constraints of operating at leboncoin’s scale — processing millions of user-uploaded ads a week — efficiency, latency, and business-rule integration started to weigh heavily on my mind.

After almost 7 years wrangling data at leboncoin as a Machine Learning Engineer, I had a nagging intuition: Do we really need a massive, generative model trained on the entire internet just to predict a simple, discriminative attribute like a sofa’s color? So, while our fine-tuned LLM was performing brilliantly, I was simultaneously challenging it against a completely different contender in the background: an in-house, lightweight, custom multimodal transformer.

As we pitted the two models against each other on our evaluation datasets, an interesting truth emerged: our significantly smaller custom transformer didn’t just match the heavy-hitting LLM; it outperformed it while solving crucial operational hurdles along the way.

Our journey led us to a tailored, pragmatic approach for production and to a simple realization: predicting a basic attribute, like a sofa’s color, doesn’t require a model trained on the entire internet. This is the story of that decision.

The problem: when text isn’t enough

To briefly recap, the core mission of the Cognition team is to make ad creation as seamless as possible for our users. One of the levers is the automation of the attribute filling task: when a user uploads an item, we want to auto-fill as many ad attributes (brand, color, memory capacity, furniture type) as we can.

Historically, our team relied on fast, character n-gram text classifiers. They are brilliant for text-heavy categories but hit a hard ceiling when users uploaded a photo of a blue couch with a sparse title like “Couch for sale.” We needed a way to “see” the image to disambiguate the listing and accurately predict the missing ad attributes.

The fine-tuned Vision-Language Model (VLM) solved this, but it introduced new MLOps challenges: massive parameter counts, heavy GPU reliance, and, above all, high inference latency inherent to how generative AI performs predictions.

The solution: visual embeddings as the missing puzzle piece

I realized that the secret sauce wasn’t necessarily the LLM’s vast reasoning capability but rather its ability to process visual context alongside text. We asked ourselves: “What if we could extract that visual context much more cheaply?”

At leboncoin, we already have rich, domain-specific visual models. Our visual search team had previously fine-tuned a robust ConvNeXt architecture — specifically on leboncoin images and labels — to provide visual search capabilities.

My intuition was simple: these existing ConvNeXt image embeddings likely already hold the exact dense, disambiguating information we need. Instead of asking a massive LLM to look at an image and generate thoughts about it, we could just grab the mathematical representation (the embedding) from our existing ConvNeXt and fuse it mathematically with the ad’s text.

Deep dive: a tailored, lightweight transformer

Instead of training a monolithic 4B or 8B parameter LLM, we built a modular, lightweight, late-fusion transformer architecture designed specifically for our domain. It treats both text and images as equal tokens in a sequence.

Here is how it works under the hood:

The Visual Backbone (Encoding): We use the leboncoin-fine-tuned ConvNeXt model purely as an embedding provider. We don’t retrain it or touch its weights; we simply serve it through KServe on a dedicated endpoint. This gives us pre-computed 700-dimensional dense vectors (allowing up to 10 images per ad).
The Text Representation (Encoding): Instead of using a heavy, pre-trained tokenizer like BERT or RoBERTa, we vectorize the ad’s subject using a character-level n-gram count vectorizer (2 to 5-grams) capped at a maximum vocabulary size of 10,000. This gives us a lightweight, sparse representation of the text that is incredibly robust to typos. Experience has proven that for super short context windows like ad subjects, this is more than enough to do the trick.
The Fusion Layer (Transformer): Both the text n-gram vector and the visual embeddings are projected into a shared embedding dimension. We then prepend a learnable CLS (Classification) token to the sequence.
Modality Embeddings: To help the transformer distinguish between text and images, we add learnable “modality embeddings” to the projected vectors — functioning much like positional encodings, but for data types.
Attention & Classification: The combined sequence is passed through Multi-Head Attention blocks. The self-attention mechanisms inherently learn how to weigh visual signals against text signals. (For example, if the text says “Couch for sale” but the image embedding strongly indicates “Black leather,” the transformer seamlessly fuses these modalities). The output state corresponding to the CLS token is then routed to independent Dense classification (heads) for each specific ad attribute to output the correct labels.

The results:

As a reminder, here are the metrics definitions:

exact match: every single ad attribute has been predicted correctly
exact match precision: number of exact matches over the total number of samples
overall precision: number of correct attribute predictions over total number of attributes to predict (considered independently)

Here are some detailed examples per attribute for 2 categories (phone and furniture)

It is small: massive parameter shrinkage

The efficiency gains of this architecture are staggering. By decoupling the visual extraction from the reasoning and dropping the generative baggage of an LLM, we achieved a 20x to 40x reduction in total parameters (even when accounting for the ConvNeXt backbone).

Visual ConvNeXt backbone sits at 200M parameters.
The fuser holds 1.6M parameters (which is less than 1% of the total weight given how big the visual encoder is)
Then comes the classification heads, depending on the leboncoin’s category we’re predicting for. Which always adds less than 100k parameters

It appears clearly that the visual backbone is responsible for almost the whole model size. Which means there can still be room for improvement if we manage to achieve the same result with a lighter visual backbone in the future.

While the ConvNeXt backbone is still served on the same GPU instances as the previous LLM (meaning no direct infrastructure gain there), the efficiency gain came entirely from swapping generative autoregressive generation for straightforward classification layers. This reduction in size and complexity translates into a massive boost in throughput, requiring fewer machines to handle our traffic.

Furthermore, because the ConvNeXt embeddings are archived as images are uploaded to our system, training the model is incredibly cheap. We only need to train the fuser, which takes approximately 15 minutes on a cheap 4-CPU machine using a dataset of half a million samples. This MLOps win allows us to retrain models frequently to capture leboncoin’s latest attributes catalog (which evolves constantly).

It is fast: sub-200ms latency

Deploying the model became trivial. The custom transformer is so lightweight that it runs entirely on CPU.

Transformer inference time: Sub-10ms.
End-to-End Latency: Factoring in network calls and visual embedding extraction, the roundtrip from the client’s mobile app, to our servers, and back clocks in at under 150ms (p95).

This allows for a real-time, frictionless user experience. By comparison, previous experiments showed that our fine-tuned LLM, even when served on a GPU, could exceed several seconds to perform a prediction.

It is convenient: calibrated probabilities

Beyond speed and size, our tailored transformer gave us a critical feature that generative LLMs notoriously struggle with: calibrated probabilities.

LLMs are inherently confident guessers. Because they generate text tokens, it is difficult to extract reliable, well-calibrated confidence scores for a specific classification. For our product integration, business rules are non-negotiable. If the model isn’t at least 70% confident about the color of a phone, we would rather leave the field blank and let the user fill it in manually. Predicting the wrong color and forcing the seller to correct a pre-filled mistake creates a frustrating UX.

Because our custom model is a discriminative classifier at its core, it outputs true probability distributions, allowing us to build safe, reliable UX guardrails.

It is happening: disambiguation

Let’s say our user sells a “veste en jean levis”:

“levis” explicitly indicates the brand
“veste” gives out the “manteau” category (label standing for both coats and jackets in our domain)
“en jean” indicates a denim material

Since denim is more often blue, text based model will do its best and guess “blue” as the color. Here are the actual probabilities with the sole text input:

But what if the denim is actually black, as in those two photos of the actual jacket:

Training data for a denim jacket. Is it black or blue ?

Here is how to probability shifts as it considers the two visual tokens in its context:

As you can see, providing visual context to our classifier helps a lot at disambiguating the “visual” attributes like color.

Conclusion: Pragmatism Over Hype

This project was a great reminder for me that pragmatism often beats hype. Don’t get me wrong, Large Language Models and VLMs are incredible tools. If you are doing zero-shot learning, lack domain-specific data, or don’t have the time and ML engineering skills to build a custom architecture, they are practically magic.

However, at leboncoin, we have a massive, continuous stream of data, mostly labeled by our own users every time they successfully post an ad. We have the MLOps maturity, the resources, and the domain-specific models already in place.

After all, leboncoin is the champion of the second-hand market, so it’s only fitting that our machine learning strategy embraces a bit of recycling, too! By repurposing the visual embeddings initially built by our visual-search team, we essentially gave them a highly successful “second life” in the ad creation flow. And this upcycling won’t stop here: we plan to reuse these incredibly versatile embeddings to step up the game for other features within our ad insertion funnel and beyond.

This doesn’t mean we are closing the door on LLMs, far from it. Our journey taught us that while LLMs might be overkill for discriminative classification at scale, they remain unrivaled at what they are designed to do: generation.

We aren’t throwing away our fine-tuning expertise or our hard-won prompt engineering lessons. Instead, we are pivoting them toward features where they provide unique value. For instance, we are currently experimenting with fine-tuned LLMs to tackle writer’s block. By generating high-quality, personalized ad descriptions based on the exact attributes our transformer just predicted, we can help sellers move from a blank screen to a polished listing in seconds.

Stay tuned!

Sometimes, the best AI strategy isn’t adopting the biggest model; it’s using or building the most suitable one.

Beyond the hype: how a custom multimodal transformer beat our fine-tuned LLM was originally published in leboncoin tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

OpenTelemetry’s Gauge vs Histogram : Latency tracking done right

leboncoin tech — Mon, 20 Apr 2026 08:02:16 GMT

Paul Drappier is a backend engineer and now Principal Architect at leboncoin, a marketplace that simplifies access to second-hand goods.
This story is rooted in a broader collective effort. Alexis Detrez contributed significantly to the migration, and the work was carried by leboncoin’s 120-person backend guild, which is responsible for monitoring our microservices.

If you’re a backend, SRE, or observability engineer who has switched from Gauge-based latency metrics to OpenTelemetry Histograms, there’s a good chance you’ve stared at suspiciously flat percentile curves and wondered if something was broken. We did — and it took us some digging to understand why.

This article tells the story of how a subtle combination of metric type semantics and default bucket configuration led to a near-total loss of latency granularity across our services, how we identified the root cause, and what we changed to fix it.

We’re sharing this story because it’s a common trap in OpenTelemetry migrations, and making these mistakes visible is how we collectively improve the state of the art.

***

At leboncoin, our backend guild recently undertook a major initiative: migrating to OpenTelemetry to modernize our observability stack. With this migration, we expected better observability. Instead, we got… flat lines.

The problem: loss of granularity in latency metrics

Historically, we used Gauge-type metrics to measure latency. During our migration, we followed OpenTelemetry’s recommendations and best practices for instrumentation and switched to Histograms (that we mapped in distributions to get global percentiles), a more statistically sound representation of response time dispersion.

The reasoning was correct:

A Gauge stores only the last value measured before being sent, which can mask significant variations within an aggregation period.
A Distribution groups values into latency buckets, offering a more representative view of response time dispersion.

So we made the switch. And immediately noticed something was wrong.

Before migration (using Gauges): Our latency curves (mean, P99, P95, P75) displayed natural fluctuations, accurately reflecting workload variations.

After migration (using Histograms): The latency curves became unexpectedly flat:

Percentiles (P99, P95, P75, etc.) appeared almost constant.
Normal latency variations were no longer visible.
Our once-dynamic curves had turned into flat lines.

Comparison of latency percentiles (P50, P75, P90): The historical Gauge metric (left) shows distinct curves for fine-grained analysis. The initial OpenTelemetry Distribution metric (right) with default settings caused the percentiles to overlap into a flat line, illustrating a critical loss of analytical granularity.

This behavior immediately alerted us, as the loss of data granularity made our monitoring and performance analysis significantly less effective.

Understanding the root cause

Investigating further, we identified two main reasons for this issue:

1. Gauges vs. distributions: why gauges hide the truth about tail latency

The fundamental issue with using Gauges for latency is that they capture only a single snapshot, the last measurement before aggregation. In a typical 10-second window where a service handles hundreds or thousands of requests, this means:

Most requests complete normally (e.g., 5–10ms)
Occasional spikes occur due to GC pauses, network issues, database slow queries, or resource contention (e.g., 50–150ms)
The Gauge captures whichever request happened last, statistically likely to be a “normal” one

Example scenario:

1000 requests in 10 seconds
950 requests at ~5ms
50 requests at ~50ms (spikes)
Gauge reports: ~5ms (95% probability of capturing a normal request)
Reality: P95 should be ~50ms, but it’s invisible

This is why tail latency (P95, P99) was systematically underreported with Gauges. The spikes existed, users experienced them, but our monitoring didn’t capture them. What looked like a regression was actually the truth surfacing for the first time. By the end of this article, Distributions will make those spikes show up clearly in our percentile curves.

Distributions provide a more accurate picture by grouping latency values into predefined buckets, which brings us to the second problem.

2. The recommended OpenTelemetry bucket configuration was too coarse

OpenTelemetry’s default latency buckets were defined as follows:

[ 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10 ]

NB: The gradations in seconds follow OpenTelemetry’s recommendations: “When instruments are measuring durations, seconds (i.e. s) should be used”.

With only 15 buckets, this default configuration created several critical problems for our monitoring needs:

Challenge 1: Scale of our infrastructure

At leboncoin, we monitor over 1000 microservices with vastly different performance characteristics:

Some services respond in milliseconds (lightweight APIs, caching layers)
Others take seconds (heavy computation, large data transfers, external API aggregation)

Challenge 2: Loss of precision in critical ranges

The default configuration had significant gaps that made our monitoring ineffective.

Between 250ms and 500ms: A 250ms gap where all values were treated identically

For services in this range, we lost the ability to detect performance degradations
A service degrading from 260ms to 480ms would appear flat in our graphs

Between 2.5s and 5s: A 2.5-second gap representing a 2x difference in performance

For slower services (content-heavy responses, complex aggregations), this was problematic
Doubling response time is a critical degradation, yet it was invisible in our percentiles

Fast services (< 100ms): Only 6 buckets for the entire 0–100ms range

Most of our high-performance services fell into this category
Insufficient granularity to detect subtle but important performance shifts

These broad intervals caused most of our latency values to fall within the same buckets, leading to excessive aggregation and the loss of meaningful variations in percentiles. The result was the “flat line effect” we observed: our monitoring couldn’t distinguish between different performance states.

Of course, this default bucket configuration is just a recommendation rather than a compulsory setup — as you will read in the next section, we broke free from it.

Our solution: designing a custom bucket configuration

To address the problem, we took the following steps:

Educate our teams on why Gauges were not an accurate latency representation and why our new metrics initially seemed “wrong.”

Customize our bucket configuration to include approximately 90 finely-tuned buckets, enabling much better granularity, a finer distribution of latency values, and a return to meaningful and dynamic percentile curves.

After analyzing our service landscape, we designed a bucket configuration optimized for our diverse workload:

[
// Maximum precision between 0s and 100ms (1ms spacing up to 10ms, then 5ms up to 50ms, then 10ms up to 100ms)
0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.010,
0.015, 0.020, 0.025, 0.030, 0.035, 0.040, 0.045, 0.050,
0.060, 0.070, 0.080, 0.090, 0.100,
// Increased precision between 100ms and 5s (25ms spacing up to 350ms, then 50ms up to 1s, then 100ms up to 3s, then 250ms up to 5s)
0.125, 0.150, 0.175, 0.200, 0.225, 0.250, 0.275, 0.300, 0.325, 0.350,
0.400, 0.450, 0.500, 0.550, 0.600, 0.650, 0.700, 0.750, 0.800, 0.850, 0.900, 0.950, 1.000,
1.100, 1.200, 1.300, 1.400, 1.500, 1.600, 1.700, 1.800, 1.900, 2.000, 2.100, 2.200, 2.300, 2.400, 2.500, 2.600, 2.700, 2.800, 2.900, 3.000,
3.250, 3.500, 3.750, 4.000, 4.250, 4.500, 4.750, 5.000,
// Reduced precision between 5s and 60s (1s spacing up to 30s, then 5s up to 60s)
6.000, 7.000, 8.000, 9.000, 10.000, 11.000, 12.000, 13.000, 14.000, 15.000, 16.000, 17.000, 18.000, 19.000, 20.000, 21.000, 22.000, 23.000, 24.000, 25.000, 26.000, 27.000, 28.000, 29.000, 30.000,
35.000, 40.000, 45.000, 50.000, 55.000, 60.000
]

This configuration provides ~90 buckets strategically distributed based on service behavior patterns:

0–100ms (23 buckets): Maximum granularity for high-performance services
  Sub-millisecond precision up to 10ms (critical for fast APIs)
  5ms intervals from 10ms to 50ms
  10ms intervals from 50ms to 100ms

100ms-5s (52 buckets): Dense coverage for typical services
  25ms precision from 100ms to 350ms (catches most API latency)
  50ms precision from 350ms to 1s
  100ms precision from 1s to 3s
  250ms precision from 3s to 5s

5s-60s (21 buckets): Coarser but adequate for slow services
  1s intervals up to 30s
  5s intervals from 30s to 60s (sufficient for timeout detection)

Roughly 90 buckets turned out to be the minimum we needed to have an improved accuracy. We chose not to go further at this stage, both because of the additional memory and CPU overhead and because the impact on payload size and agent-side cost was still not fully clear.

In this chart, each point is a latency value sent via a bucket, and each line is the median of a bucket configuration.

—

We can observe the bucket configuration change on Wednesday 15th, in the previous illustration. Before this date, the smallest bucket was 5ms, meaning any latency lower than that was put in the 5ms bucket.

By changing our bucket configuration, we immediately recovered that granularity.

—

On this graph, we can see the P50, P75, and P90 percentiles. The before/after effect of the bucket distribution change is clearly visible.

With this adjustment, we successfully restored a detailed and accurate view of our service performance.

This gave teams actionable signals. At the level of an individual service, finer granularity made it possible to diagnose intermittent spikes, sometimes down to a rare SQL query running much slower than the rest.

—

Key takeaways

Be aware that Gauges can hide significant latency variations. They capture only the last value in an aggregation window, systematically masking the tail latency that users actually experience.

OpenTelemetry’s default bucket configuration may not fit your specific workload; customization is key. 15 buckets across a 0–10s range is rarely sufficient for a production system with diverse service profiles.

If you see flattened percentile curves after migration, reconsider your bucket granularity before drawing any conclusions. Don’t mistake quieter graphs for better performance.

Comparing your sources of truth carefully helps. The moment we overlaid our Distribution-based curves against tracing data, the discrepancy was immediately apparent. Cross-validation is your best debugging tool.

By sharing our experience, we hope to help others avoid similar pitfalls and ensure a smoother transition to OpenTelemetry. If you’re planning an OpenTelemetry migration, pay close attention to how your metric types behave and what your bucket boundaries actually cover. Small configuration details can have major impacts on observability!

This article is part 2 of a two-part series on OpenTelemetry metrics at leboncoin. Read Part 1 for Story 1: Avoid a massive metric loss — delta vs cumulative metrics.

OpenTelemetry’s Gauge vs Histogram : Latency tracking done right was originally published in leboncoin tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Confidence by Spotify: bringing A/B testing into the product conversation at leboncoin

leboncoin tech — Tue, 24 Mar 2026 10:22:18 GMT

Confidence by Spotify: how A/B testing enriches product conversations at leboncoin

Patrice Chaperon is Director of the Platform & Insight tribe at leboncoin, that includes Data, Infrastructure and DevX teams, representing more than one hundred people.
As part of his data strategy, he steered the deployment and adoption of Confidence by leboncoin teams.

This article also highlights the people who made this happen: Alexis Mauzé, Léa Rouillard, Qian Jin, Joséphine Lhermite, Olivier Dumenil, Frédéric Jouannaud, and the Spotify Confidence team.

At leboncoin, experimentation isn’t a side practice. It’s how we build and improve our platform. Across web, mobile apps, and backend services, teams regularly run A/B tests to validate hypotheses, improve user journeys, and guide product decisions with data.

In 2024, we ran about 160 experiments totaling over 35 billion user impressions. Behind each of those tests was a conviction that learning beats guessing and that every experiment, whether it ships or not, makes us sharper. Yet with more than 700 people over 70 feature teams contributing to the same codebase, we were looking to expand the practice.

So when we decided to swap our internal legacy tool for an external, beta-stage platform, it wasn’t a small call. This is the story of how we adopted Confidence by Spotify and how that decision reshaped not only our tooling but also our experimentation culture.

A strategic shift to scale experimentation

For years, we had been running experiments through an in-house-built tool. It got the job done: the software let us implement feature flags and randomly split visitors across all our platforms (web, mobile, and backend). But, as our experimentation culture grew more ambitious, its limits became harder to ignore.

Experimentation was effective, yet concentrated in the hands of a small group of experts. Running a test often required heavy involvement from data teams, manual processes, and bespoke analysis. Engineers would ship the flag, then analysts handled everything else: from sizing and defining metrics to pulling exposure data and computing statistical significance by hand.

That process worked, but it didn’t scale. It didn’t empower product teams to move fast. And it created a bottleneck we could no longer afford. So we decided to look for an external solution to accelerate, scale, and align with market standards.

The ambition was clear:

empower product teams to run experiments more autonomously,
standardize experimentation practices across P&T teams,
and reduce the friction and hidden costs associated with running each new test.

The bet: choosing Confidence by Spotify

When evaluating available solutions, we explored both in-house and third-party options. The criteria were demanding:

Full-stack coverage: web, mobile, and backend feature flags — not just one layer
Low maintenance overhead
Seamless integration with our existing data infrastructure and workflows
Flexible segmentation and progressive rollout capabilities
Data stays in our own data platform
Robust, automated statistical analysis
Scalable to our volume (millions of users) without breaking the budget

We evaluated the usual suspects: well-known products like Optimizely and LaunchDarkly, and open-source options such as GrowthBook and Flagsmith. We considered rebuilding in-house. None of these options hit the sweet spot. Then we discovered Confidence, an experimentation platform developed internally by Spotify’s team that opened as a SaaS product in 2025. And it seemed to check every box in our requirements grid:

native support for feature flags and A/B tests across web, mobile, and backend,
warehouse-native analytics, with no data stored externally,
strong statistical tooling that “forces teams to set hypothesis and be outcome-driven”,
a pricing model that fits our budget,
and a product philosophy centered on learning rate rather than success rate.

When we first saw it, two things stood out: it was built by teams deeply experienced in large-scale experimentation, and it was still in external beta. Becoming one of their first external design partners was a big bet. But that early-adopter position turned into a win-win.

As a large-scale early-adopter user, Spotify was extremely open and reactive to our feedback. On the other hand, as one of Confidence’s first significant external design partners, our adoption helped push the platform from beta to general release. That position came with real risk — but also with something rare: a real partnership with the team building the tool.
We’ll come back to that.

Screenshot of our criteria grid to assess the best experimentation solution

The integration: what “full-stack” actually looks like

Once the ink was dry on our decision, the real work began: integrating Confidence into our tech stack. This was no small feat. We planned to run a thorough proof-of-concept project to validate the integration step by step.

The first challenge was embedding Confidence’s SDK everywhere: our web frontend, iOS and Android apps, and backend services. To avoid locking themselves into a single provider, teams implemented an OpenFeature compatibility layer: a vendor-agnostic, community-driven API for feature flagging. On mobile, especially, this replaced a heterogeneous mix of previous solutions.

And from a technical standpoint, this is one of Confidence’s biggest strengths: true cross-platform experimentation. Features can be developed once at the backend level and evaluated consistently across web and mobile surfaces.

“Before, experimentation was mostly backend-driven. Now we’re running real cross-platform experiments.”
— Joséphine L’Hermite, Search Product Manager

On mobile, the adoption was equally impactful.

“We were already used to feature flags with Firebase. Confidence adds product-oriented feature flags on top of that. It enables impact-driven experimentations. And we’re not even at the full potential yet.”
— Qian Jin, Android developer, AdLife squad

From a developer experience standpoint, the shift was tangible.

“The code integration is really fast to set up. It doesn’t put a heavy burden on developers.”
— Léa Rouillard, Web Guild Leader

One reason Confidence adoption went smoothly on mobile is that the groundwork was already there. iOS and Android teams had long been in the habit of flagging every feature before merging, making the shift to Confidence a natural extension of an existing workflow rather than a disruptive change.

“Everything was already flagged. Plugging Confidence into that setup was straightforward. We deploy multiple times a day, QA can test several features on the same build, and activation is controlled safely.”
— Frédéric Jouannaud and Olivier Dumenil, Mobile Guild Leaders

The privacy challenge

One of the most critical aspects of the adoption was privacy. As we operate in France, we had to ensure that running experiments with Confidence respected users’ privacy choices as defined by:

the CNIL (France’s data protection authority) legal framework,
and the European GDPR (General Data Protection Regulation).

For us, it meant two things: data had to remain in-house, and experimentation had to respect user consent.

Keeping all data in-house

Confidence’s warehouse-native mode was a non-negotiable for us. Instead of storing event data on its own servers, Confidence queries our Redshift warehouse directly for experiment metrics. We built dedicated fact tables to aggregate experiment events and user attributes, and scheduled ETL jobs to keep them up to date. The result: Confidence does the analysis, but all the data stays ours.

Managing consent

The core issue: A/B testing requires assigning users to cohorts. The CNIL treats it as an analytics/tracking activity, which requires explicit consent. Progressive rollouts, on the other hand, can qualify as functional (essential for service stability) and may not require consent at all.

Confidence, like most experimentation platforms, relies on a persistent identifier to assign users to variant. In our case, we store it as a first-party cookie, a reliable approach for web (though not a cross-platform solution by nature). So, how do you run compliant experiments while maintaining the ability to do gradual rollouts for all users, regardless of consent?

Our solution was a dual-identifier system:

A visitor_id set as a first-party cookie on the first visit, used for rollouts and essential features.
An experiment_id issued only after a user consents through our Consent Management Platform, used for A/B tests

Users without consent are automatically held in a neutral variant for any experiment, so that no user is included in an experiment without proper consent. Once consent is granted, they’re seamlessly assigned an experiment_id and included going forward.

“Without consent, we use rollout IDs stored in a cookie with a limited lifespan, validated by the privacy team.”
— Léa Rouillard, Web Guild Leader

This architecture, co-designed with our engineering leads and validated with Spotify’s input, meant we could keep running progressive rollouts safely across 100% of users while keeping A/B experiments fully consent-gated. Compliance wasn’t bolted on after the fact. It was baked into the foundation.

Screenshot of tech architecture design leboncoin x confidence

A new experimentation baseline at leboncoin

Perhaps the most visible impact of Confidence has been at the organizational and mindset level. But it starts with a deceptively simple observation: the tool works for everyone, regardless of their background.

An interface that bridges expertise levels

One of Confidence’s core strengths is its UI, both universal and adapted to different areas of expertise. And it’s harder to get right than it sounds.

Indeed, aligning siloed teams around a shared experimentation workflow, with no dedicated organizational home, requires a surface that speaks to product managers, engineers, and statisticians at the same time. Confidence pulls that off.

A product manager can open the platform, read the hypothesis their squad formulated, check the health metrics, and get a clear recommendation (ship or don’t ship) without needing to interpret a single p-value. The Spotlight feature surfaces automated recommendations based on health checks and guardrail metrics, so the decision is left to zero interpretation.

Screenshot of health checks monitoring in Confidence

Screenshot of Spotlight’s recommendation in Confidence: Don’t ship

At the same time, the depth is there for those who need it. Data teams can inspect the full statistical machinery behind every result: confidence intervals, sample size calculations, metric sensitivities. The simplicity on the surface doesn’t come at the cost of rigor underneath.

“Confidence gives you the guarantee of running statistically sound experiments without being a statistics expert. It creates a common foundation for a product expert trying to validate a hypothesis, while offering the depth a statistician needs to verify that the answer is accurate and precise.”
— Alexis Mauzé, Technical Program Manager

This matters because it changes the dynamic between product and data teams. Previously, product teams would build their own analysis, often with in-house tooling, and data analysts would step in afterward to verify (and sometimes correct) the statistical validity of the results. Now, statistical rigor is built into the platform itself. Once a metric or fact table is properly configured, it benefits every subsequent experiment that uses it. The setup cost is front-loaded; the productivity gains compound.

Building trust from within

A tool that bridges expertise levels only works if people trust what it tells them. And that trust has to be earned.

Confidence earns it on two fronts. First, reliability: the platform does what it says. Flags behave as configured, allocations are consistent, results are reproducible. For teams that had previously worked with more fragile in-house setups, that predictability is a foundation.

Second, statistical credibility: the statistical engine and approach behind Confidence’s calculations are backed by peer-reviewed research. That’s not a marketing claim; it’s what allows data teams to sign off on results without rebuilding the analysis from scratch, and what lets product managers act on recommendations with confidence rather than second-guessing the numbers. It’s what allows the whole organization to trust the numbers.

When a platform is both reliable and statistically sound, something shifts. Teams stop debating the tool and start focusing on what matters: the methodology, the hypothesis, and the learning. And usage automatically increases.

From a handful of experts to hundreds of users

Before, experimentation tools were used by about ten people on a regular basis. Today, more than 300 collaborators actively use Confidence across product, engineering, and data roles.

For product managers, the shift is toward genuine autonomy. Setting up a test no longer requires engineering involvement for the analytics layer. The dashboard is readable without a SQL background.

“It’s extremely simple to see whether a feature flag is on or off, to test yourself in a variant, and to understand results. It really pushed experimentation culture further for everyone.”
— Joséphine L’Hermite, Search product manager

Confidence’s dashboard has become a shared discussion space inside each feature team, a common language that bridges silos between product, engineering, and data.

Screenshot of Confidence UI with number of users and sankey diagram of experimentation learnings

Optimizing for learning, not just success

A key mindset shift accompanied the tooling change: focusing on learning rate rather than success rate. Spotify built Confidence around a philosophy that resonates deeply with where we want to go: the learning rate matters more than the win rate (read about it on Spotify’s engineering blog). Most experiments don’t ship to production. That’s not failure. That’s a signal. A well-designed experiment that tells you “this doesn’t work for our users” is worth as much as one that launches; sometimes more.

This outcome-driven approach changes how teams approach testing. An experiment isn’t a bet you win or lose. It’s a question you answer.

Confidence enforces that discipline structurally. Before running a test, teams must define their hypothesis, their primary metric, and their guardrails. The platform won’t let you shortcut the methodology.

And that discipline compounds over time. Product teams are maturing their experimentation practices, improving statistical literacy, and embracing experimentation as a learning tool rather than a delivery constraint.

“It really helps evangelize data-driven thinking. Everyone gets better at asking why, not just what.”
— Joséphine L’Hermite, Search product manager

Developing an experimentation culture

Joséphine’s team is one of the most experiment-intensive at Leboncoin, running A/B tests across search features at high frequency. For them, Confidence resolved a long-standing tension: the tool now supports cross-platform experimentation natively, where previously they were limited to backend-driven piloting.

For engineers, Confidence’s flag management has become part of the standard build process, in large part because the groundwork was already there. Mobile teams had long been in the habit of flagging every feature before merging. Confidence extended that habit into a full experimentation workflow: progressive rollouts allow teams to validate not just that a feature works, but that it delivers real value. If conversion is up, we roll the feature out to everyone. If not, we can easily roll it back. Confidence didn’t slow down delivery; it made it safer.

For data analysts, the impact is structural. Previously, every experiment carried a non-trivial setup cost: pulling exposure data, computing metrics, and verifying significance manually through complex Python notebooks. That friction acted as a brake on experimentation volume. With Confidence, statistical analysis runs automatically from warehouse queries on top of metrics that are shared and reusable across teams. The entry cost exists, but it’s paid once.

The next phase is already in motion: increasing adoption across remaining squads, improving experimentation methodology, and building the tooling and practices that will let leboncoin raise the pace of learning across the entire product organization.

The partnership that made it work

One of the most frequently cited reasons this project succeeded is the working relationship with Spotify’s Confidence team. Spotify’s Confidence team didn’t just hand us a platform and a knowledge base. They treated our implementation as a joint engineering challenge, and that’s a different category of vendor relationship. A shared Slack channel was (and is still) active throughout. Spotify engineers came on-site to help with onboarding. Monthly workshops, frequent syncs, dedicated training materials, and near-real-time responsiveness became the norm.

When our engineers hit edge cases, particularly around the dual-ID consent logic, Spotify’s team didn’t just answer questions. They shared how Spotify handles similar problems internally, giving us battle-tested playbooks rather than generic documentation. And when it was relevant, they delivered key features in the hour following a conversation.

“leboncoin didn’t just adopt Confidence, they embraced our learning-rate philosophy from the start . A lot of what they ran into while scaling experimentation we’d lived at Spotify, but there were real differences too, as their data stack was mature and opinionated, Redshift instead of our BigQuery, and working within that gave us the opportunity to pressure-test every assumption together. That kind of transparency is rare in a vendor relationship. It’s more like how good engineering teams work internally. The result is a platform that’s better for everyone who comes after them, and a partnership we have really enjoyed.”
— Donna Hanafi, Senior Product Manager for Confidence, by Spotify

This dynamic between a large external client and a beta-stage product team turned out to be a genuine win-win. We brought real production constraints that helped Confidence mature — and that maturation was visible. Several features were shipped in beta specifically to address our use cases, and our production environment continues to influence the product roadmap today. They brought deep experimentation expertise that accelerated our setup.

The reactivity was unlike anything we’d experienced with a software vendor before: when something was relevant to our needs, improvements landed incredibly fast.

A platform, a culture, a new baseline

Today, Confidence by Spotify is not just another tool in our stack. It has become the standard experimentation platform across product, tech, and data teams — enabling faster iteration, stronger cross-functional collaboration, and better privacy guarantees along the way. And most of all, it helped spread a shared experimentation culture across the company.

Adoption is growing, and the trajectory is clear. New squads are onboarding regularly. New use cases are emerging in places that had never been part of an experimentation workflow before.

Confidence has become a daily touchpoint. It is not just a place for data analysts to compute KPIs, nor a place for devs to set up feature flags, or even for PM to organize an experiment. It has become a shared surface where PMs, engineers, and data people look at the same numbers and make decisions together, breaking down silos that used to slow everything down.

Not every team has fully adopted it yet. But for the first time, we have the infrastructure, the tooling, and the culture that make continuous experimentation possible at scale. The journey is far from over. If anything, it’s just entering its most interesting phase.

NDLR : Go beyond this piece with this recording of Paris’s Meetup with leboncoin and Spotify’s Confidence team 👇

https://medium.com/media/0cfbfdbf152b716daed578274409c46d/href

Confidence by Spotify: bringing A/B testing into the product conversation at leboncoin was originally published in leboncoin tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

How 1 hour of fine-tuning beat 3 weeks of RAG engineering

leboncoin tech — Thu, 12 Mar 2026 15:30:53 GMT

We fine-tuned our first LLM at leboncoin, and here’s what we learned

Meiyi PAN, Senior Machine learning Engineer @leboncoin

The color problem

At leboncoin, France’s largest classified ads platform, we help millions of users sell their stuff every day. Ad posting is at the heart of our marketplace, it’s the critical moment where supply enters the platform. When someone lists an iPhone for sale, we ask them to fill in attributes: brand, model, storage, and color. These attributes power search filters and help buyers find exactly what they’re looking for.

The challenge? Filling in all these fields takes time. This is where the Cognition team comes in. We’re a product and ML team focused on making ad creation faster and smoother by building ML and GenAI-powered services that reduce seller effort and improve ad quality.

To address this challenge, we first developed a machine learning model back in 2020: a text classifier based on character n-grams (2–5 grams) with a simple neural network architecture:

Title Text → CountVectorizer (char n-grams) → Dense(32) → Dropout → Dense (per attribute)

It’s elegant in its simplicity: transform the title into a sparse bag-of-ngrams vector, embed it into a 32-dimensional space, then classify each attribute independently. The model is fast, cheap to serve, trained on millions of ads, and retrained daily.

For text-heavy attributes, it works remarkably well. If someone writes “iPhone 13 Pro 256GB”, the character n-grams capture “iPho”, “Phon”, “hone”, “13 P”, “Pro”, “256G”, “GB”, enough signal to reliably extract brand (Apple), model (iPhone 13 Pro), and storage (256GB).

But color was a different story.

Our production model’s accuracy on color attributes could have been better. The title often doesn’t mention the color at all (“Selling my iPhone, perfect condition!”) or uses ambiguous terms (“Pacific Blue” vs “Sierra Blue”, both are blue, but which one?).

The information was right there in the images. A human could look at the photo and instantly see the phone is blue. But our text-based model couldn’t see images.

That’s when we decided to experiment with Vision-Language Models.

The hypothesis was simple: if we give the model both the title and the images, we should get better predictions, especially for visual attributes like color.

Why Qwen3-VL-8B?

We selected Qwen3-VL-8B as our base model, recognizing it as the current state-of-the-art open-source VLM. The decision to use this model was driven by three main factors:

French language support: critical for processing our ads.
Resource efficiency: it fits on a single A10G GPU with 4-bit quantization, an affordable and widely available instance on AWS.
Native support for structured extraction: natively supported by vLLM for guided JSON decoding, essential for our use case.

We also benchmarked against Claude Haiku 4.5. Our base Qwen experiments matched or exceeded the proprietary API, confirming that domain adaptation is more crucial than raw model capability for this specific task.

The real-world complexity: multi-category, multi-task prediction

Before diving into our experiments, let me show you what “ad parameter prediction” actually means at scale. This isn’t a toy problem.

We support 16 categories, each with different attributes and label spaces:

HIGH COMPLEXITY (brand-heavy categories)
+------------+------------+-------------------------------------------+
| Category   | Attributes | Complexity Notes                          |
+------------+------------+-------------------------------------------+
| Clothing   | 6          | 1,500+ brands alone (!)                   |
| Shoes      | 5          | 700+ brands, size dependencies            |
| Phones     | 4          | 700+ models, brand -> model dependencies  |
| Accessories| 5          | 200+ brands                               |
+------------+------------+-------------------------------------------+

MEDIUM COMPLEXITY
+-----------------+------------+----------------------------------+
| Category        | Attributes | Complexity Notes                 |
+-----------------+------------+----------------------------------+
| Baby Equipment  | 3          | ~200 brands                      |
| Watches         | 4          | 100+ luxury & mass-market brands |
| Home Appliances | 3          | ~75 brands, product types        |
+-----------------+------------+----------------------------------+

LOWER COMPLEXITY (still multi-label)

+-----------------------------+------------+---------------------------+
| Category                    | Attributes | Notes                     |
+-----------------------------+------------+---------------------------+
| Toys                        | multiple   | fewer brands              |
| Bikes                       | multiple   | attribute dependencies    |
| Furniture                   | multiple   |                           |
| Decoration                  | multiple   |                           |
| Video Games                 | multiple   |                           |
| Consoles                    | multiple   |                           |
+-----------------------------+------------+---------------------------+

Total: 200+ unique attributes across 16 categories, thousands of possible labels

Some categories are “easy” (bikes: 20 labels across 3 attributes). Others are nightmares (clothing: 1,500+ brands alone, before counting colors, sizes, types…).

And then there are dependencies. In phones, the valid models depend on the brand:

- If phone_brand = "Apple" then phone_model is one of {iPhone 13, iPhone 13 Pro, iPhone 14, …} (42 models) 
- If phone_brand = "Samsung" then phone_model is one of {Galaxy S23, Galaxy S24, …} (80+ models) 
- Total: 24 brands × varying models = 700+ valid combinations, but only ~30 are valid for any given brand

This is the reality of machine learning we’re tackling: a multi-category, multi-task, multi-label classification problem with intricate hierarchical dependencies and massive label spaces.

Exact Match: How we measure success (in %)

Throughout this article, we use Exact Match (all attributes correct) because a partial prediction still requires manual correction. Per-attribute accuracy can be misleading: on clothing, 59% accuracy translates to only 6.5% exact match, meaning 93.5% of predictions still need user intervention. Exact match is a better metric for user experience.

Over the next 6 weeks, we tried three different approaches with Qwen. Here’s what worked, what didn’t, and why we wish we’d tried fine-tuning first.

V0 — The naive approach: just put everything in the prompt

What we tried

Our first attempt was the most obvious one. We had a powerful Vision-Language Model (Qwen3-VL-8B). We had labeled data with all valid attribute values. Why not just tell the model all the options and let it pick?

We built a naive prompting system:

1. Take the ad title and images

2. Inject all valid values for each attribute into the prompt

3. Ask the model to extract the correct ones

The prompt looked something like this:

You are an assistant that extracts product attributes from ads.

Given this ad:
Title: "iPhone 13 Pro 256GB excellent état"
Images: [image1.jpg, image2.jpg]

Extract the following attributes. Use ONLY values from the provided lists.

phone_brand (valid values): Apple, Samsung, Xiaomi, Huawei, OnePlus, Google,
Sony, LG, Nokia, Motorola, OPPO, Vivo, Realme, Honor, Asus, ZTE, Alcatel,
Blackberry, HTC, Lenovo, Meizu, Nothing, Fairphone, … [24 brands]
phone_model (valid values): iPhone 13, iPhone 13 Mini, iPhone 13 Pro,
iPhone 13 Pro Max, iPhone 14, iPhone 14 Plus, iPhone 14 Pro, iPhone 14 Pro Max,
iPhone 15, iPhone 15 Plus, iPhone 15 Pro, iPhone 15 Pro Max, Galaxy S23,
Galaxy S23+, Galaxy S23 Ultra, Galaxy S24, … [742 models!]

phone_memory (valid values): 16GB, 32GB, 64GB, 128GB, 256GB, 512GB, 1TB [8 values]

phone_color (valid values): Noir, Blanc, Bleu, Rouge, Vert, … [16 colors]

Return a JSON object with the extracted values.

For phones, that’s 790 labels in the prompt. Manageable, maybe.

But for clothing? The prompt would need to list 1,500+ clothing brands. Impossible.

What happened

Everything went wrong.

For “simpler” categories like phones (Category 17), the results were mediocre: 29.5% exact match (meaning all attributes correct). That’s worse than our production model (37%).

But for complex categories, the results were catastrophic:

+-------------------+-------+--------+-----------+-----------------------------+
| Category          | Attr  | Labels | V0 Prompt | What went wrong             |
+-------------------+-------+--------+-----------+-----------------------------+
| Clothing          | 6     | 1600+  | 0.0%      | 1500+ brands alone          |
| Decoration        | 4     | 75     | 0.0%      | Model confused by options   |
| Toys              | 3     | 36     | 5.0%      | Small but still failed      |
| Children Furniture| 3     | 237    | 5.6%      | 200 baby equipment brands   |
+-------------------+-------+--------+-----------+-----------------------------+

Zero percent. On clothing, the model couldn’t get a single prediction fully correct out of 200 samples.

Why it failed

1. Prompt length explosion

For clothing, we had 6 attributes to predict:

- brand: 1,500+ labels 
- category: 16 labels
- color: 21 labels 
- type of cloth: 4 labels 
- clothing size: 59 labels

Just listing the brand names would consume ~15,000 tokens. Even though Qwen3-VL-8B supports a 128K context window, we found that the model struggled to pick the right option from 1,500+ similar brand names . The large number of choices led to confusion and hallucination.

2. Similar options confused the model

When you give the model 500 phone models, it starts making mistakes on similar names:

- "iPhone 13 Pro" vs "iPhone 13 Pro Max" 
- "Galaxy S23" vs "Galaxy S23+" vs "Galaxy S23 Ultra" 
- "Xiaomi 13" vs "Xiaomi 13 Pro" vs "Xiaomi 13 Ultra"

The model would pick a plausible-looking option that wasn’t quite right.

3. No understanding of dependencies

Here’s a subtle but critical problem: attributes have dependencies.

If phone_brand = "Apple", then phone_model must be one of 42 Apple models (iPhone 13, iPhone 13 Pro, iPhone 14…). 
If phone_brand = "Samsung", then phone_model must be one of 80+ Samsung models (Galaxy S23, Galaxy S24…).

There are 24 brands, each with their own valid model list, totaling 700+ phone models.

But our simple prompt just listed all 700+ models together. The model sometimes predicted brand = “Apple” and model = “Galaxy S23”, a pure hallucination. Or it would pick a model that doesn’t exist for that brand.

Clothing was even worse. The clothing_type (Haut, Bas, Ensemble, Autre) determines the valid clothing_st (size/style) values. Furthermore, the large number of brand labels necessitated a “fuzzy match” strategy, which conducts regex-like word searches in the title to find the closest brand match, as we cannot enumerate all valid combinations.

What we learned

Lesson 1: Dumping all labels into the prompt doesn’t scale. The model gets overwhelmed.

Lesson 2: Without explicit dependency handling, the model makes inconsistent predictions.

Lesson 3: Prompt engineering alone can’t solve complex structured extraction problems.

We needed a smarter approach.

V1 - The Complex RAG: Cascade 2-Step Prediction with Dependency Handling

What We Tried

V0 failed because we threw everything at the model at once. The obvious solution: break the problem into smaller pieces.

We designed a cascade 2-step prediction system with explicit dependency handling:

Step 1: Predict “parent” attributes first (like brand)

Step 2: Use the parent predictions to filter valid “child” options (like model), then predict those

This required building several components:

1. Dependency Configuration

We mapped out all the parent-child relationships between attributes.

// Phone category: 24 brands → 742 models
{
  "dependencies": [{
    "master": "phone_brand",
    "dependent": "phone_model",
    "mapping": {
      "apple": ["iphone11", "iphone11pro", "iphone11promax", "iphone12",
                "iphone12mini", "iphone12pro", "iphone13", "iphone13mini",
                "iphone13pro", "iphone13promax", "iphone14", "iphone14plus",
                "iphone14pro", "iphone14promax", "iphone15", "iphone15plus",
                "iphone15pro", "iphone15promax", "iphone16", "iphone16pro",
                ...],  // 42 Apple models
      "samsung": ["galaxys23", "galaxys23plus", "galaxys23ultra",
                  "galaxys24", "galaxys24plus", "galaxys24ultra",
                  "galaxyzflip5", "galaxyzfold5", ...],  // 80+ Samsung models
      "xiaomi": ["xiaomi13", "xiaomi13pro", "xiaomi14", ...],
      // ... 24 brands total, 742 models total
    }
  }]}


// Clothing category: even more complex


{
  "dependencies": [
    {
      "dependent": "clothing_brand",
      "strategy": "fuzzy_match",  // Can't enumerate 1,503 brands!
      "short_list": ["nike", "zara", "kiabi", "levis", "adidas", "hm",
                     "decathlon", "shein", "ralphlauren", "lacoste", ...]
      // Use fuzzy matching with a "common brands" short list
    },
    {
      "master": "clothing_type",
      "dependent": "clothing_st",
      "mapping": { /* 4 types → 59 sizes */ }
    }
  ]
}

More than just configuration, it was domain knowledge encoded as data structures. And it had to be maintained for 16 categories, some with multiple dependency chains.

2. Two-Step Inference Pipeline

# Simplified cascade prediction flow

async def predict_with_cascade(ad_title: str, images: list[str], category: str):
    # Step 1: Get parent attributes
    parent_attrs = get_parent_attributes(category)  # e.g., ["phone_brand"]

    parent_prompt = build_prompt(
        ad_title, images,
        attributes=parent_attrs,
        valid_values=get_all_values(parent_attrs)  # Smaller list
    )

    parent_predictions = await model.predict(parent_prompt)
    # Result: {"phone_brand": "Apple"}

    # Step 2: Get child attributes with filtered options
    child_attrs = get_child_attributes(category)  # e.g., ["phone_model"]

    # Key insight: filter valid models based on predicted brand
    filtered_values = filter_by_parent(
        child_attrs,
        parent_predictions  # Only show Apple models now
    )

    child_prompt = build_prompt(
        ad_title, images,
        attributes=child_attrs,
        valid_values=filtered_values,  # Only ~50 Apple models, not 500+
        context=parent_predictions  # Tell model the brand is Apple
    )

    child_predictions = await model.predict(child_prompt)
    # Result: {"phone_model": "iPhone 13 Pro"}

    # Merge results
    return {**parent_predictions, **child_predictions}

3. Dynamic schema generation

At the end, we also built a schema generator that created JSON schemas with oneOf branches for dependent fields:

def build_schema_with_dependencies(category: str, parent_values: dict):
    """Generate JSON schema that only allows valid combinations."""
    schema = {"type": "object", "properties": {}}

    for attr, parent in get_dependencies(category):
        parent_value = parent_values.get(parent)
        if parent_value:
            # Only include child values valid for this parent
            valid_children = dependency_config[attr][parent_value]
            schema["properties"][attr] = {
                "type": "string",
                "enum": valid_children
            }

    return schema

The architecture

Here’s what the V1 system looked like:

This was a significant engineering effort:
— ~300 lines for the dependency configuration system
— ~400 lines for the cascade predictor
— ~250 lines for the dynamic schema builder
— ~150 lines for value mapping
— Plus testing, edge case handling, and documentation

About 3 weeks of work.

What happened

The good news: V1 worked much better than V0.

+-------------+-----------+----------------+-------------+
| Category    | V0 Prompt | V1 Complex RAG | Improvement |
+-------------+-----------+----------------+-------------+
| Phones      | 29.5%     | 38.5%          | +9.0%       |
| Clothing    | 0.0%      | 10.5%          | +10.5%      |
| Decoration  | 0.0%      | 13.5%          | +13.5%      |
| Toys        | 5.0%      | 29.0%          | +24.0%      |
| Video Games | 67.0%     | 86.4%          | +19.4%      |
+-------------+-----------+----------------+-------------+

We went from 0% to actually working on the hard categories. The cascade approach fixed the dependency problem. No more “Apple brand, Galaxy model” nonsense.

But there were costs:

1. Double the latency

Two inference calls meant double the time:

- Step 1: ~300ms

- Step 2: ~300ms

- Total: ~600ms + overhead

For a real-time API, this was painful.

2. Maintenance nightmare

Every time a new phone model launched, we had to update the dependency config. Samsung releases a new Galaxy? Update the config. Apple announces iPhone 16? Update the config.

And the config had to be perfect. If we forgot to add a model to the Samsung list, the system would never predict it, even if the model clearly saw “Galaxy S24” in the title.

3. Edge cases everywhere

What happens if Step 1 predicts the wrong brand?
Now Step 2 is working with the wrong filtered list. The error cascades.

What about brands with only one model? Do they still need two steps?

What about circular dependencies? (Yes, some categories had them.)

Each edge case required special handling, more code, and more bugs.

4. Still not beating production

Here’s the frustrating part: even with all this complexity, V1 only matched our production model on average. We beat it on some categories (video games: 86.4% vs 82%), but lost on others (watches: 17.5% vs 25.5%).

All that engineering, and we were roughly breaking even.

What we learned

Lesson 4: Cascade prediction fixes dependency issues, but at the cost of latency and complexity.

Lesson 5: Configuration-based approaches create a maintenance burden. Every new product needs manual updates.

Lesson 6: 3 weeks of engineering to match (not beat) the existing solution is not a good ROI.

We needed something simpler that could learn dependencies automatically.

V2 — The simple solution: what if we just fine-tuned?

The insight

After weeks of building the cascade system, we had a realization: the model doesn’t need us to tell it that iPhones are made by Apple. It can learn that from examples.

Think about it. In our training data, every time the model sees “iPhone 13 Pro”, the label says brand: Apple, model: iPhone 13 Pro. Every time it sees “Galaxy S24”, the label says brand: Samsung, model: Galaxy S24.

Show the model thousands of these examples, and it will learn the pattern. No configuration file needed. No dependency resolver. No two-step cascade.

The relationship between brands and models isn’t something we need to engineer. It’s something the model can learn.

What we tried

So we decided to build a fine-tuning pipeline using LoRA and Unsloth:

1. Dataset preparation

We sampled real ads from our database:
— 16 categories
— ~200 examples per category
— Each example: title, images, ground-truth attributes

We formatted them as chat-style conversations in French:

{
  "messages": [
    {
      "role": "system",
      "content": "Tu extrais les attributs produits à partir du titre et de l'image. Réponds en JSON valide uniquement."
    },
    {
      "role": "user",
      "content": [
        {"type": "image", "image": "image_base64_data"},
        {"type": "text", "text": "Catégorie: Téléphones\nTitre: \"iPhone 13 Pro 256GB Bleu Pacifique excellent état\"\n\nExtrais: phone_brand, phone_model, phone_memory, phone_color"}
      ]
    },
    {
      "role": "assistant",
      "content": "{\"phone_brand\": \"Apple\", \"phone_model\": \"iPhone 13 Pro\", \"phone_memory\": \"256GB\", \"phone_color\": \"Bleu\"}"
    }
  ]
}

2. LoRA fine-tuning

We used Unsloth for efficient fine-tuning:

from unsloth import FastVisionModel
from trl import SFTTrainer, SFTConfig


model, tokenizer = FastVisionModel.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct",
    load_in_4bit=True,  # Fits on single A10G GPU
)

model = FastVisionModel.get_peft_model(
    model,
    r=16,              # LoRA rank
    lora_alpha=32,     # Scaling factor
    target_modules=[   # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    args=SFTConfig(
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=2e-4,
        warmup_steps=50,
        eval_strategy="steps",
        eval_steps=100,
        save_strategy="steps",
        save_steps=100,
        load_best_model_at_end=True,
    ),
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

Training time: ~1 hour on a single A10G GPU.

That’s it. No dependency configuration. No cascade logic. No value mappers.

3. Inference with Guided Decoding

For serving, we used vLLM with JSON schema constraints:

from openai import AsyncOpenAI

client = AsyncOpenAI(base_url="http://localhost:8000/v1")

response = await client.chat.completions.create(
    model="fine-tuned-qwen",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_content_with_images}
    ],
    extra_body={
        "guided_json": json_schema_for_category  # Enum constraints
    }
)

The guided decoding ensures valid JSON output with values from our enum lists 99%+ valid response rate.

The architecture

Components: 1 (the model)
Inference calls: 1
Config files: 0
Maintenance: Retrain when needed

What happened

The results surprised us:

+-------------+-------------+----------------+-------+
| Category    | Complex RAG | Fine-tuned V2  | Delta |
+-------------+-------------+----------------+-------+
| Phones      | 38.5%       | 49.5%          | +11.0 |
| Appliances  | 62.0%       | 72.0%          | +10.0 |
| Clothing    | 10.5%       | 16.5%          | +6.0  |
| Toys        | 29.0%       | 40.0%          | +11.0 |
| Bikes       | 14.0%       | 26.5%          | +12.5 |
| Tablets     | 50.5%       | 57.5%          | +7.0  |
| Consoles    | 86.4%       | 87.5%          | +1.1  |
+-------------+-------------+----------------+-------+

Fine-tuning won in 10 out of 16 categories. And the wins were significant, often 10%+ improvement.

Results Table

+--------------------------+-------+-------+-------+--------+----------------+
| Category                 | Prod  | Haiku | V0    | V1 RAG | V2 FT (Δ vs V1)|
+--------------------------+-------+-------+-------+--------+----------------+
| Phones                   | 37.0% | 27.0% | 29.5% | 38.5%  | 49.5% (+11.0%) |
| Home Appliances          | 67.0% | 29.0% | 50.0% | 62.0%  | 72.0% (+10.0%) |
| Clothing                 | 6.5%  | 6.0%  | 0.0%  | 10.5%  | 16.5% (+6.0%)  |
| Baby Equipment           | 29.5% | 16.0% | 39.5% | 42.5%  | 44.5% (+2.0%)  |
| Decoration               | 18.0% | 14.0% | 0.0%  | 13.5%  | 17.0% (+3.5%)  |
| Toys                     | 46.5% | 18.0% | 5.0%  | 29.0%  | 40.0% (+11.0%) |
| Watches                  | 25.5% | 7.0%  | 19.0% | 17.5%  | 17.5% (0.0%)   |
| Video Games              | 82.0% | 83.0% | 67.0% | 86.4%  | 80.0% (-6.4%)  |
| Bikes                    | 32.5% | 13.0% | 13.5% | 14.0%  | 26.5% (+12.5%) |
| Childrens Room Furniture | 27.3% | 19.0% | 5.6%  | 22.4%  | 26.6% (+4.2%)  |
| Tablets                  | 40.5% | 51.0% | 12.0% | 50.5%  | 57.5% (+7.0%)  |
| Consoles                 | 80.5% | 85.4% | 79.0% | 86.4%  | 87.5% (+1.1%)  |
+--------------------------+-------+-------+-------+--------+----------------+
| AVERAGE                  | 34.9% | 24.3% | 22.4% | 34.6%  | 37.7% (+3.1%)  |
+--------------------------+-------+-------+-------+--------+----------------+

Legend:
V0 = Naive prompting(all labels in prompt), 1 week engineering
V1 = Complex prompting (cascade 2-step), 3 weeks engineering
V2 = Fine-tuned (1 hour training), 1 week engineering (+X.X%) = Delta vs V1 Complex RAG

Why fine-tuning won

1. Dependencies are learned, not configured

The model saw thousands of examples where “iPhone” appeared with brand: Apple. It learned the pattern. We didn’t need to tell it.

When V2 sees a new ad with “iPhone 15 Pro Max”, it predicts the brand: Apple automatically, even if we never explicitly listed that model in any configuration.

2. Single inference call

V1 needed two calls (parent → child). V2 needs one. ~40% faster.

3. No configuration drift

When Samsung releases a new phone, V1 needs a config update. V2 just needs to see examples of that phone in the next training batch, which happens naturally as users list new products.

4. The model learns visual patterns

This is the subtle one. V1’s cascade system was still fundamentally text-based with images as context. V2 was trained end-to-end on title+image → attributes.

The fine-tuned model learned to actually look at the images. For color attributes (our original motivation!), this made a huge difference.

What we learned

Lesson 7: Fine-tuning can learn dependencies implicitly. No need to engineer them.

Lesson 8: 1 hour of training can replace 3 weeks of architecture.

Lesson 9: Simple solutions often outperform clever engineering.

The engineering effort comparison

Let’s be honest about what each approach cost:

+------------------+----------+---------------+--------+-----------+
| Approach         | Eng Time | Code          | Maint  | Accuracy  |
+------------------+----------+---------------+--------+-----------+
| V0 Naive Prompt  | ~1 wk    | ~200 lines    | Low    | 22.4%     |
| V1 Complex RAG   | ~3 wks   | ~1100 + 500c  | High   | 34.6%     |
| V2 Fine-tuned    | ~1 wk    | ~1300 lines   | Low    | 37.7%     |
+------------------+----------+---------------+--------+-----------+

V2 required similar code to V1 (dataset prep, training loop, inference), but: — No dependency configuration files
— No cascade orchestration logic
— No value mapping layers
— No edge case handling for two-step inference

And critically: no ongoing configuration maintenance.

When each approach makes sense

Fine-tuning isn’t always the answer. Here’s our framework:

Use naive prompt (V0) when:

You’re prototyping and need something quick
The label space is small (<50 options per attribute)
You have no training data

Use complex RAG (V1) when:

You need explainability (“model chose X because of retrieved example Y”)
Training data is impossible to get
Labels change daily and retraining isn’t feasible
You can’t afford GPU training

Use fine-tuning (V2) when:

You have labeled training data (we had 6 months of labeled ads)
Accuracy is the priority
You want simpler deployment and maintenance
You can retrain periodically (monthly, quarterly)
Dependencies exist between attributes

For our use case, structured attribute extraction with stable (but large) label sets and plenty of training data, fine-tuning was the clear winner.

Why the production model still wins on some categories

Looking at our results, something interesting stands out: the production n-gram model still beats our fine-tuned VLM on certain categories:

+-------------+--------+--------+--------+--------+
| Category    | Labels | Prod   | V2 FT  | Winner |
+-------------+--------+--------+--------+--------+
| Video Games | 88     | 82.0   | 80.0   | Prod   |
| Watches     | 184    | 25.5   | 17.5   | Prod   |
| Toys        | 36     | 46.5   | 40.0   | Prod   |
| Furniture   | 80     | 23.0   | 19.5   | Prod   |
+-------------+--------+--------+--------+--------+

Why does a simple character n-gram classifier outperform an 8B parameter VLM on these categories?

The n-gram model’s secret weapon: pattern matching on structured text

The production model leverages CountVectorizer with character n-grams (2–5) for pattern matching on structured text, proving highly effective in categories with predictable title formats.

For example, in the Video Games Category (88 labels), structured titles like “FIFA 24 PS5 neuf sous blister” allow the model to capture n-grams like “PS5”, “FIFA”, “24”, and “neuf”, accurately identifying the console brand, model, and game type. The model’s millions of learned patterns instantly link terms like “PS5” to Sony, efficiently handling the limited set of brands and models.

Similarly, in the Watches Category (184 labels), brand-heavy titles such as “Montre Rolex Submariner homme automatique” yield brand-indicative n-grams (“Rolex”, “Subm”), enabling immediate recognition of the brand and type among the 138 commonly mentioned watch brands, often without needing an image.

Where VLM Wins: Visual Information and Massive Label Spaces

The fine-tuned VLM excels where the n-gram model struggles:

Phones: V2 wins 49.5% vs 37.0% (790 labels). Visual differentiation handles colors, model variants (like Pro/Pro Max camera layouts), and overcomes confusion from 742 similar character patterns in titles.
Home Appliances: V2 wins 72.0% vs 67.0% (129 labels). Brand logos (Dyson, Samsung, LG) and product type are often clearer visually than in descriptions.
Clothing: V2 wins 16.5% vs 6.5% (1,600+ labels). N-grams fail here. Color and style (casual, formal) are nearly always visual, and logos identify many of the 1,500+ brands not in titles.
Bikes: V2 slightly loses at 26.5% vs 32.5% (20 labels). Visual classification determines bike type (VTT, route, ville) via frame geometry and sometimes size from the frame.

The Insight: Text Density + Label Space Size

We can roughly predict which model wins based on two factors:

Choosing the right model based on Text density and Label space

Examples:
“FIFA 24 PS5 neuf” → N-gram: All info in text, small label space
“Montre Rolex Submariner” → N-gram: Brand in title, moderate labels
“iPhone bon état” + [image] → VLM: Color/model in image, 742 models to choose
“Robe été” + [image] → VLM: Brand/color in image, 1,500+ brands!

The Hybrid Opportunity

This analysis suggests a potential hybrid approach:
— Use the fast, cheap n-gram model for high-text-density categories (video games, watches, consoles)
— Use the VLM for low-text-density categories where visual information matters (clothing, phones, appliances)

A category-aware router could get the best of both worlds: n-gram speed where text is sufficient and VLM accuracy where images matter.

What’s Next

This is not yet the final production model. We’ve been challenging the fine-tuned VLM against an in-house multimodal transformer architecture that fuses text n-grams with visual embeddings, and it’s actually doing a surprisingly good job. The battle between “simple fine-tuned VLM” and “custom lightweight fusion model” is still ongoing.

Stay tuned for the next blog post, where we’ll dive into that comparison and reveal which approach made it to production.

Key Takeaways

Start simple, but be ready to try fine-tuning. Our V0 was the right first experiment. But we should have tried fine-tuning before building V1’s complexity.
Dependencies can be learned, not engineered. 1 hour of training taught the model what 500 lines of config tried to encode.
Complex architectures have hidden costs. V1 worked, but the maintenance burden was unsustainable.
Training data > clever prompts. The model learned more from 10K examples than from our carefully crafted prompts.
Simple solutions often win. V2’s architecture fits in one box. V1’s had seven components. V2 won.

The best code is often the code you don’t write. Sometimes the smartest engineering decision is to let the model learn instead of trying to teach it through architecture!

How 1 hour of fine-tuning beat 3 weeks of RAG engineering was originally published in leboncoin tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Fraud as a misuse case : when the user journey is used against the product

leboncoin tech — Wed, 28 Jan 2026 10:48:35 GMT

Fraud as a misuse case : when the user journey is used against the product

Banner featuring Aymeric Milcent and Maxence Audibert

This article is contributed by four team members : Maxence Audibert, Aymeric Milcent, Steeven Cabouret, Mickaël Pois-Castiglioni. Each author shares a situated perspective, making the collective work visible.

Fraud is not a product flaw

Fraud in online marketplaces is most often described using technical language: attack vectors, compromised accounts, anomalous behaviour, payment abuse. This vocabulary is necessary. It reflects the work of security, data and payment teams who deal with fraud every day.

But from a product perspective, it only tells part of the story. Fraud is not just a technical incident that happens at the edges of a system. It is a misuse case: a hostile but structurally valid use of the same product journey. The same screens. The same flows. The same moments of decision.

For users, fraud does not start when money is lost or when an account is blocked. It starts earlier, at a much quieter moment. A hesitation. A sense of urgency. A message that feels slightly off, without being obviously wrong.

This article explores fraud from that angle: not as a security problem to be eliminated, but as a product challenge to be designed for. It focuses on the legitimate user journey, and on how product decisions shape trust by making risk intelligible without breaking usability.

From a product perspective, the user journey can be seen as a progressive funnel. At each stage, different teams and mechanisms reduce uncertainty for the next one: identity checks at onboarding, credibility signals in listings, behavioural cues in messaging, and stronger controls at payment and post-transaction. The goal is not to eliminate risk entirely, but to make sure that as users move forward, fewer ambiguous situations remain.

A second part will mirror this analysis from the opposite perspective, exploring fraud as a hostile user journey and what it implies for product discovery and ownership.

Part 1 — Fraud as a misuse case on the legitimate user journey

Ambiguity is inevitable at scale

When users arrive on a peer-to-peer marketplace, they interact with strangers by design. There is no shared history, no physical context, no social cues. Everything that would normally help establish trust offline must be reconstructed using partial and often ambiguous signals.

This ambiguity is not a flaw of the platform. It is the direct consequence of openness and scale. Any system that allows millions of people to interact freely will inevitably attract a wide spectrum of behaviours, from entirely legitimate to openly malicious.

“When you reach a certain scale, fraud is normal. It’s not a signal of failure. It’s a signal of usage.” — explains Aymeric Milcent

In short, it is the statistical shadow cast by success.

But users do not experience fraud statistically, they experience it emotionally. What they feel is a background need to stay alert, a low-level tension that accompanies the journey from the first interaction.

For product teams, the challenge is not to remove uncertainty, but to prevent it from becoming overwhelming.

Account creation: where journeys start to diverge

Account creation is often treated as a formality. From a product perspective, it is the first behavioural fork between legitimate and hostile journeys. A legitimate user expects onboarding to feel simple, coherent and legitimate. Too little friction suggests carelessness. Too much friction creates suspicion and fatigue. Behind the scenes, however, this step already contains strong signals. Timing patterns, device consistency, repetition, automation and identity reuse often reveal hostile intent before any explicit rule is broken.

The paradox is well known. The more effective detection becomes, the less visible it should be. A strong anti-fraud system does not announce itself. It quietly shapes the journey so that legitimate users move forward without noticing anything unusual.

Listings: credibility is shaped before it is perceived

Creating a listing is the first moment where credibility takes shape. What looks like a simple form is, in reality, a powerful design surface. Structured templates, mandatory fields, category selection and price guidance do more than help users describe an item. They define what “normal” looks like and reduce the number of ways a listing can appear almost legitimate while remaining misleading.

From a product perspective, this is not about judging intent. It is about shaping the space in which intent can express itself. Only once this work is done does browsing begin.

When users browse listings, they do not audit them. They infer credibility from fragments: photos, tone, price and overall coherence. The platform’s role is to make credibility readable, not to ask users to investigate.

Good UX does not only protect users at browsing time. It sharpens the definition of normal behaviour upstream. Fraud depends on blending in. Product decides what the crowd looks like.

Messaging: where fraud becomes emotional

Messaging is the most familiar part of the marketplace journey. It looks like any other chat interface and borrows the codes of everyday conversation. That familiarity is precisely what makes it such a sensitive surface.This is where fraud stops being technical and becomes emotional.

In most cases, nothing in the message itself looks suspicious. Grammar is correct. Tone is polite, sometimes even reassuring. What changes is not the wording, but the context: a push to move faster, a suggestion to step outside the platform, a request framed as convenience.

Product design intervenes quietly. Messages that appear in the flow do not accuse or alarm. They guide users, reminding them where protection applies and why staying within the product matters.

This balance is fragile. Too many warnings exhaust trust. Too few leave users alone with doubt. Product teams continuously iterate on copy, timing and placement to stay on the right side of that line.

Protection here is not a hard stop. It is product literacy. These interventions are symmetrical. They appear whether the other party is malicious or simply unaware of best practices. The platform does not claim fraud is happening. It states where safety exists.

When users are left in uncertainty, fraud has more room to operate. But when platforms sound overly alarmist or accusatory, trust quickly erodes. Product design must find a middle ground: guiding users without making them feel blamed or unsafe.

Payment: where trade-offs become unavoidable

Payment is the emotional peak of the journey. From the user’s perspective, payment must feel predictable and intelligible. From the product side, it is a dense risk surface combining behavioural history, device signals and identity patterns.

A static approach does not work in this context. Fraud tactics evolve too quickly. What is needed instead is dynamic friction, adjusted in real time based on risk.

Despite the high volume of traffic and fraud attempts, which is a natural consequence of being one of the most visited platforms in France, only a very small fraction of payment flows actually result in confirmed fraud.

What matters for users, however, is not the real fraud rate, it is the perception of risk. User trust is shaped less by internal metrics than by what circulates outside the product: online discussions, social media, personal stories, and word of mouth. These narratives can amplify fear, even when they do not reflect the broader reality of the platform.

Screen capture of the app promoting secured transaction process

From a product perspective, this means trust must be designed, not just measured. Clear flows, predictable steps, and visible protection mechanisms help payments feel reliable, even in an environment where users know that risk exists.

There is also, from a bank perspective, a structural benefit to low fraud levels : they lead to smoother authorisation flows, fewer payment rejections, and better acceptance rates from banking partners. Reducing fraud then does not only protect users : it improves conversion, reliability, and the overall quality of the payment experience.

After the transaction: learning does not stop at conversion

The user journey does not end once the transaction is completed.
Some forms of misuse only become visible after delivery, support interactions, or resolution processes.

For users, this stage is about reassurance and clarity. They need to understand what happens next, what protections still apply, and how issues will be handled. Predictability at this moment plays a key role in maintaining trust.

For product teams, post-transaction signals are a source of learning.
They help refine upstream decisions, improve detection, and adjust where friction is added or removed. Fraud prevention is not a one-time effort. It is an ongoing process of observation, iteration, and improvement.

From uncertainty to clarity

A marketplace does not earn trust by promising that nothing will ever go wrong. At scale, such a promise would be neither credible nor useful.Trust is earned by ensuring users are never left alone with uncertainty. By making risk understandable. By being present at the moments that matter, especially when things do not go as planned.

Users do not finish the journey believing fraud is impossible. They finish knowing that the product anticipates risk, responds coherently, and remains accountable throughout the experience. That is what trust looks like at scale.

Part 2 / Fraud as a misuse case : when the user journey is used against the product

In the first part of this article, we explored fraud from the perspective of the legitimate user. We followed the journey as it is meant to be experienced and observed how product decisions shape trust by making risk intelligible rather than invisible.

That perspective, however, only tells half of the story.

To fully understand fraud as a misuse case, product teams must also examine the same journey from the fraudster’s side. Not from the point of view of hesitation or doubt, but from the point of view of intent. A hostile intent that is, but one that still relies on the same screens, the same flows and the same rules.The question is then no longer how a feature builds trust, but how that same feature could be used to abuse the system.

As we explained, fraud is not random. It is deliberate, structured and adaptive. It is precisely why it belongs in product engineering, from discovery to monitoring.

Understanding the fraudster’s logic

A fraudster does not interact with a marketplace the way a legitimate user does. Where a buyer looks for an item at a good price and a seller looks for visibility, the fraudster looks for weaknesses. Their objective is not simply to exploit the system once. Very often, it is to find a repeatable pattern, industrialise it, and scale it in a way that remains difficult to detect.

“Fraud is a business in its own right. It is structured: there are teams, processes, and managers. We have seen entire communities where hundreds of people work full-time on scams. It is a parallel economy, without scruples.” — Aymeric Milcent

This industrial logic has several consequences.

First, fraud is automated whenever possible. Scripts replace manual actions, bots simulate human behaviour, and generative AI produces listings, images and messages at scale. Second, fraud is adaptive. What works today will be tested until it no longer does. When a protection becomes effective, tactics shift elsewhere in the journey. Third, fraud relies on imitation. The fraudster does not try to look suspicious. Quite the opposite. They aim to look like a “good” user: realistic prices, complete profiles, friendly tone, plausible timing.

Money is often the end goal, but it is not always obtained in the same way. On a marketplace like leboncoin, several distinct fraud logics coexist, each with different incentives. Some forms of fraud aim to steal directly: compromised user accounts, fraudulent use of stolen cards, or purchases made with stolen or synthetic identities. Others aim to avoid paying. This includes unpaid deferred payments, or professionals disguising themselves as private individuals to bypass fees or rules (a pattern that is particularly relevant in hybrid C2C/B2C marketplaces).

The most visible category is about being paid through deception: fake disputes, false claims, scams involving fake sellers or fake buyers, or sending counterfeit or incorrect items.

Finally, there is fraud aimed at laundering money. This often involves collusion between buyer and seller, or micro-transactions designed to pass under detection thresholds, using a perfectly legal platform as a conduit.

From a product perspective, these are not edge cases. They are misuse cases product teams must actively reason about, each with different signals, trade-offs and impacts.

What the fraudster attempts at each step of the journey

Because fraud is intentional, it mirrors the legitimate user journey almost perfectly. Each step of the product becomes an opportunity to probe the system, test its limits, and identify where scale is possible.

Account creation: identity as an attack surface

At sign-up, fraudsters rarely experiment manually for long. They look for volume and repeatability. Scripts automate account creation, synthetic identities mix real and fabricated data, and compromised accounts are reused whenever possible. Strong identity checks play an important role, but they are not sufficient on their own.

As Steeven Cabouret points out, “strong KYC and 3DS are necessary, but they don’t solve everything. Fraudsters use real identities or impersonate legitimate actors. Identity alone doesn’t tell the full story.”

From a product standpoint, this reinforces a key principle: identity cannot be treated as a one-off verification. It must be interpreted through behaviour over time. Account creation is not a single gate, but the beginning of a behavioural observation window that continues throughout the journey.

Listing creation: scaling credibility

When posting listings, fraudsters focus on scale. Cloned ads, reused images and automated descriptions generated by AI allow hundreds of listings to be deployed quickly, often with minor variations designed to avoid simple duplication checks. In the past, volume and noise were often enough. Today, the challenge for fraudsters is different: they must appear consistent across the entire journey.

Their listings, prices, images, messaging style and account behaviour all need to tell the same story.

This is where product constraints matter. Templates, mandatory fields, category rules and brand handcheck programs narrow the space in which imitation can succeed. They force credibility to be constructed within tighter boundaries. Each blocked or removed listing forces adaptation.

“Every removed listing is an opportunity to learn. The system improves because misuse leaves traces.” — Mickael Pois-castiglioni

From a product perspective, this turns fraud into a form of involuntary product testing. Misuse reveals where constraints are effective and where they still leave room for ambiguity.

Messaging: emotional manipulation at scale

Messaging is where emotional manipulation becomes central. Fraudsters rely on persuasion techniques designed to influence users’ behaviour rather than exploit technical flaws. Fraudsters test tone, timing and emotional triggers. They impersonate support agents, manufacture urgency, offer reassurance at strategic moments, and continuously adapt their language. With generative AI, linguistic mistakes are no longer a reliable signal.

As Maxence Audibert says, “fraud today is less about spelling mistakes and more about urgency, stress, or a tone that feels reassuring”.

Privacy constraints limit how much message content can be inspected. As a result, detection shifts toward behavioural signals: pacing, repetition, coordination across accounts, and attempts to move users outside the platform’s protected space. From a product perspective, the challenge is not to accuse, but to guide. Interventions must exist without breaking trust, making users aware of protection boundaries without creating fear or suspicion.

Payment: testing the limits of friction

At payment time, fraudsters explore the system’s tolerance. Carding, 3DS circumvention, collusion between buyer and seller, or repeated low-value transactions designed to stay under thresholds are all common tactics. Static rules quickly fail in this context.

“The challenge is adding friction for fraudsters while limiting it for legitimate users.” — Mickael Pois-castiglioni

This is not a security tuning exercise. It is a prioritisation problem, continuously revisited as behaviours evolve and as trade-offs between protection and usability shift.

Post-transaction: abusing resolution mechanisms

After the transaction, fraudsters often turn their attention to resolution systems. False claims, fake non-receipt or attempts to manipulate customer support become more visible at this stage. In some cases, fraudsters even contact support directly, attempting to reverse decisions or exploit ambiguity in the process.

From a product perspective, this phase is critical. It reveals intent that earlier stages could not confirm and feeds learning loops back into the system, informing future decisions upstream.

The product response: a living system

What emerges instead is a living system, composed of multiple layers working together: behavioural scoring, human moderation, adaptive rules, weak signals, and continuous learning. AI plays a central role but is never alone. Models learn from past misuse, but human expertise remains essential to handle edge cases and confirm legitimate behaviour when automated systems raise doubts. As Mickael Pois-Castiglioni notes, “it’s AI versus AI. Fraudsters use it too. But the human eye is still irreplaceable.”

In practice, this often concerns situations where legitimate users trigger fraud-like signals. For example, a user who changes their email address several times in a short period may be temporarily restricted because this pattern resembles account takeover behaviour. In such cases, contacting customer support allows the situation to be clarified and the account restored.

Ethics also matter. Detection must respect privacy, avoid discrimination, and protect legitimate users from exclusion. Designing protection without excluding atypical but legitimate behaviours is a product responsibility, not a legal afterthought.

This reinforces a core principle of misuse case product discovery: fraud must be observed before it can be eliminated. Overreacting too early can damage trust just as much as underreacting.

From misuse to product improvement

Looking at fraud through the fraudster’s journey changes the framing. Fraud is no longer a bug to fix (as we concluded in the first part of this article), but it’s also not an external threat to contain. It is feedback. A hostile signal that exposes where the product can be misunderstood, misused or exploited. Reducing fraud, in this sense, is inseparable from improving the product itself. Better flows, clearer rules, more readable credibility signals and more adaptive friction all benefit legitimate users as much as they constrain misuse.

“Reducing fraud is not a sprint, it’s a marathon. It’s a question of learning speed, we must stay one step ahead.” — Maxence Audibert

Product teams do not fight fraud : they design against it. And by treating fraud as a misuse case product teams fully own, they end up building better products for everyone else.

Fraud as a misuse case : when the user journey is used against the product was originally published in leboncoin tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deep dive in Open Telemetry Metrics with two stories at leboncoin tech 1/2

leboncoin tech — Tue, 23 Dec 2025 08:56:15 GMT

Deep dive in Open Telemetry metrics with two stories at leboncoin tech 1/2

Deep dive with open Telemetry : cumulative vs delta / gauge vs buckets

Story 1 : Avoid a massive metric loss — delta vs cumulative metrics

Paul Drappier and Alexis Destrez are backend guild leaders at leboncoin, a marketplace that simplifies access to second-hand goods.

If you’re a backend, SRE, or observability engineer migrating a production system to OpenTelemetry metrics, there’s a very high chance you’ll run into the Cumulative vs Delta question.

We did — and we got it wrong at first.

This article tells the story of how a subtle aggregation choice led to silent metric loss at scale, how we identified it, and what we changed to fix it.

No end users were impacted by this issue. The problem only affected internal observability signals, but it could have led us to make wrong operational or product decisions if left unresolved.

We’re sharing this story because this is a common trap in OpenTelemetry migrations — and because making these mistakes visible is how we collectively improve the state of the art.

* * *

At leboncoin, our backend guild recently undertook a major initiative: migrating to OpenTelemetry to modernize our observability stack.

In our previous article, we detailed our plan, which can be shortly summarized as follows:

introducing a middleware generating OpenTelemetry metrics behind a feature flag,
keeping the legacy StatsD middleware for a dual-run phase,
changing the flag behavior to turn it into a switch, allowing either Opentelemetry or legacy metrics,
removing the legacy middleware once the rollout was complete.

Everything appeared to be running smoothly… until we started enabling OpenTelemetry at scale across dozens of services handled by nearly 120 engineers. That’s when we discovered a serious issue: up to 20% of metrics were disappearing during short time windows. A level of loss that could jeopardize dashboards, alerting, and overall system reliability and performance. This article walks you through how we identified, reproduced, debugged, and ultimately resolved this issue, and the lessons learned along the way.

The Initial Setup: A Controlled Dual-Run

To ensure a smooth migration, we first introduced a feature flag allowing teams who opted in for early testing to enable OpenTelemetry metric emission on their services. During this initial period, services with the feature flag enabled operated in dual mode, sending metrics to both StatsD and OpenTelemetry.

Our early tests showed promising results: OpenTelemetry metrics, APM-based (tracing) metrics, and StatsD metrics displayed similar shapes, trends, and magnitudes on all services running in dual mode.

OTEL hit vs Traces and Statsd Hit

Since keeping both telemetry systems (StatsD and OpenTelemetry) for all our services would have significantly increased observability costs, we decided to scale the migration across all backend services without dual-run, relying on the alignment we had validated.

The First Signs of Trouble: Missing Metrics in Production

Within hours of officially activating OpenTelemetry for high-traffic services, several teams reported an alarming issue: some HTTP 400 requests visible in tracing did not appear in OpenTelemetry metrics. After a bit of digging, we realized that the problem was broader: all metrics sent by the app were affected.

A deeper analysis revealed:

Traffic trends were consistent between StatsD and OpenTelemetry.
But when zooming in on fixed and short-time windows (e.g., one hour), discrepancies of 5–6% appeared.
Worse, metric loss spiked up to 20% over 5–10 minute periods.

Recognizing the severity of the issue, we paused the migration and asked teams to revert to StatsD while we investigated the problem.

First Investigations: Reproducing the Bug

We started a war room and attempted to reproduce the problem in a controlled local environment. Running a single pod sending metrics to a local Datadog agent showed no data loss.

However, when we introduced a second pod of the same application, we observed:

The new pod’s metrics did not appear in Datadog.
Metrics from the first pod also stopped appearing.

At this stage, we suspected a local configuration issue rather than a replication of the production bug, leading to some early frustration.

Breakthrough: Understanding Metric Aggregation Temporality

A key insight came when one of our engineers introduced us to the concept of Aggregation Temporality in OpenTelemetry, which defines how metrics accumulate and are exported over time. There are two modes:

Cumulative: values accumulate from the start of the process.
Delta: each export reports only the change since the previous transmission.

Upon further investigation, we discovered that:

The OpenTelemetry library we were using emitted values in Cumulative mode.
While our historical StatsD + Datadog Agent setup was fully in Delta mode.
and Datadog Agent requires a clear source identifier (e.g., a pod_name tag) to distinguish different data streams and correctly ingest cumulative metrics

Without this identification, when multiple pods sent cumulative metrics to the same agent, data continuity was lost, causing metric drops.

Fixing the Problem: Aligning Temporality and Control

To resolve the issue, we made two key changes:

Switching OpenTelemetry metric temporality to Delta, to align with our historical StatsD behavior.
Experimenting with an OpenTelemetry Collector to give us better control over metric aggregation and forwarding.

After applying these changes, the metrics between StatsD and OpenTelemetry lined up, and the loss disappeared.

Key Takeaways

Understand aggregation temporality: mismatches between Cumulative and Delta modes can cause silent metric loss.
Leverage tracing to cross-validate metrics: discrepancies become much easier to detect.
Always test multi-pod scenarios early: Some issues only appear at scale.
Compare your sources of truth carefully: comparing two “bar/bar” sources is inherently difficult; comparing a cumulative reference against another signal is much easier and can quickly reveal inconsistencies.
Deepen your understanding of the SDK and export pipeline: knowing exactly how metrics are aggregated and sent to the collector is crucial when debugging distributed metric systems.
Don’t rely solely on long-term dual running: months of dual emission didn’t reveal the problem at the trend level, even though short-window discrepancies were significant once we zoomed in.
Scrutinizing the data can uncover hidden issues: by digging deeper, we realized that we were already losing some StatsD metrics, long before introducing OpenTelemetry.

Migrating observability systems at scale is always challenging, but we hope our experience will help other teams avoid similar pitfalls.

If you’re planning an OpenTelemetry migration, pay close attention to how your data flows and how your backend services interact. Small configuration details can have major impacts!

***

In the next story, we’ll dive into the HTTP latency tracking issues we encountered during the migration and how we resolved them.

Deep dive in Open Telemetry Metrics with two stories at leboncoin tech 1/2 was originally published in leboncoin tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Enhanced accessibility gestures for custom components on Android

leboncoin tech — Fri, 24 Oct 2025 11:56:30 GMT

Let’s explore the various ways to make a custom component accessible.

By Nicolas Mayer Weiler, Android Engineer at leboncoin

When developing an Android app, you would likely make it accessible to reach a wide target audience, including people affected by a disability. A very good set of tools is available for such users to improve their experience: Talkback for blindness and other visual impairments, Switch access for motor skills impairments, to name only the most known. It is up to us developers to take advantage of these tools and to expand access to our apps to any user.

In practice, what makes an app accessible to impaired people? In Europe, the official referential for accessibility on mobile devices is the RAAM which exposes 107 criteria.

In this article, we will focus on gestures and look at how to provide efficient navigation to people using Talkback. Have in mind that any enhancement made for this tool would also be applied in some ways for the other accessibility services.

Let’s dive into this challenge with the Stepper component:

Stepper component

In the context of a travelling app, this component allows the user to set the number of travellers by decreasing or increasing it using the two respective buttons. There is more than one solution to make this component accessible. We will explore three of them while evaluating the following points:

Development complexity
Update value gesture complexity
Navigation gesture complexity

Basic Stepper

Video showing a Basic Stepper read by Talkback (with subtitles)

The first solution is the simplest one. The following scheme explains the semantics of the components in terms of gestures:

Basic Stepper semantics scheme

As you can see, the semantics naturally split the component into three items: the label and each of the two buttons. The user navigates between these items by swiping left or right and updates the value by double-taping while focusing on one of the two buttons.

Here is how we can achieve this:

https://medium.com/media/69dfcb77ed09420522ea4d39a5c4f77e/href

Line 18, the semantics for the Text showing the label are overridden.

Text(
    modifier = Modifier.semantics {
        contentDescription = stepperContentDescription
        liveRegion = LiveRegionMode.Polite
    },
    text = stringResource(id = R.string.stepper_title),
)

By setting a custom contentDescription, we ensure to provide both the label and the current value when reading the component title. In addition, we also set the liveRegion to LiveRegionMode.Polite to read this contentDescription again each time the value is updated, so the user is always aware of the current state of the Stepper.

The contentDescription is defined in line 7 and points to a plurals string resource.

val stepperContentDescription = pluralStringResource(
    id = R.plurals.stepper_value_content_description,
    count = currentValue,
    currentValue,
)


    %1$d traveller
    %1$d travellers

Line 41, we clear the semantics for the Text showing the value in order to skip it as this value is already read before with the label.

Text(
    modifier = Modifier.clearAndSetSemantics {  },
    text = currentValue.toString(),
)

And we don’t forget to set a contentDescription to the icons contained within the two buttons, lines 45 and 55.

Icon(
    painter = painterResource(id = R.drawable.ic_increase),
    contentDescription = stringResource(id = R.string.stepper_action_increase_content_description),
)

Remove one traveller
Add one traveller

This solution works fine and simply uses the basic concepts explained in the Android accessibility documentation. However, the user has to swipe many times to reach the different focusable parts of this component. Imagine what happens when there are many Steppers in the same screen. Instead of just asking the number of travellers, for instance, a detail could be asked to the user: number of adults, children, babies, pets… If the user just wants to set the number of pets, they would need to swipe many times before reaching the correct Stepper. What can we do to make smoother navigation?

Custom actions Stepper

Video showing a Custom actions Stepper read by Talkback (with subtitles)

With this solution, we provide a list of actions reachable through the Talkback menu. The following scheme shows the semantics of gestures for this component:

Custom actions Stepper semantics scheme

Despite interacting with this Stepper seems more complex, the advantage is that the semantics are fully merged, allowing to easily navigate to the next component after this Stepper, which is interesting when the user doesn’t want to change its value.

We can achieve this Stepper like this:

https://medium.com/media/7dd80eba4466fe44a2ece75860ed19b1/href

This time, we apply the semantics to the full Row, line 17.

Row(
    modifier = modifier.clearAndSetSemantics {
        contentDescription = stepperContentDescription
        liveRegion = LiveRegionMode.Polite
        customActions = stepperCustomActions
    },
    horizontalArrangement = Arrangement.spacedBy(16.dp),
    verticalAlignment = Alignment.CenterVertically,
)

We use clearAndSetSemantics in order to replace the semantics of the whole content with this new semantics. Once again we provide the contentDescription and the liveRegion defined earlier. We also add what is key to this concept: the customActions, defined line 12 and pointing to a function line 61.

val stepperCustomActions = defineStepperCustomActions(
    currentValue = currentValue,
    onValueChange = onValueChange,
)

@Composable
private fun defineStepperCustomActions(
    currentValue: Int,
    onValueChange: (Int) -> Unit,
): List =
    listOf(
        CustomAccessibilityAction(
            label = stringResource(id = R.string.stepper_action_decrease_content_description),
            action = {
                onValueChange(currentValue - 1)
                true
            },
        ),
        CustomAccessibilityAction(
            label = stringResource(id = R.string.stepper_action_increase_content_description),
            action = {
                onValueChange(currentValue + 1)
                true
            },
        ),
    )

The contentDescriptions related to the decrease and increase actions are now used in the customActions and should then be removed from the icons, lines 47 and 54.

Icon(
    painter = painterResource(id = R.drawable.ic_increase),
    contentDescription = null,
)

This solution, using a concept described in the documentation, presents the advantage of providing a smoother navigation from one component to another. However, many interactions are still necessary each time the user needs to update the Stepper value. Besides, this requires a tap with three fingers, which can be considered as a complex gesture. In addition to being annoying to the user, complex gestures prevent from respecting criteria 11.10 from the RAAM.

On each screen, are the features that can be activated using a complex gesture able to be activated using a simple gesture (excluding special cases)?

Taping with three fingers has indeed an alternative: swiping down then swiping right with one finger, but this alternative is not explicitly mentioned by Talkback and is not necessarily considered as a simple gesture either, which makes it a problem.

So, is there any solution allowing us to make a smooth navigation while avoiding the problem of complex gestures?

Progress Stepper

Video showing a Progress Stepper read by Talkback (with subtitles)

This time, the Stepper is considered as a Slider by the accessibility service, then it offers the following semantics:

Progress Stepper semantics scheme

Now, both interacting with the component and navigating to the next one are easy. The user swipes up or down to adjust the value, swipes left or right to navigate to another component.

Here is how we can achieve this Stepper:

https://medium.com/media/edef9eec775c0255365aacad48301c6f/href

Once again, the semantics are applied to the full Row, line 15.

Row(
    modifier = modifier.semantics(mergeDescendants = true) {
        stateDescription = stepperStateDescription
        setProgress { newValue ->
            onValueChange(newValue.roundToInt())
            true
        }
    }.progressSemantics(
        value = currentValue.toFloat(),
        valueRange = 0f..MAX_VALUE.toFloat(),
        steps = MAX_VALUE,
    ),
    horizontalArrangement = Arrangement.spacedBy(16.dp),
    verticalAlignment = Alignment.CenterVertically,
)

We set mergeDescendants to true so the semantics are all merged, and provide a stateDescription so the user is always aware of the current value. This stateDescription contains exactly the same text as the contentDescription provided in the previous Stepper examples. In addition, we use setProgress to dispatch the new value to the Slider semantics each time it is updated, and we add progressSemantics to receive this value. A range and a number of steps are required.

Then, we need to clear the semantics from the label and the Stepper buttons, lines 30 and 34.

Text(
    modifier = Modifier.clearAndSetSemantics {  },
    text = stringResource(id = R.string.stepper_title),
)
Stepper(
    modifier = Modifier.clearAndSetSemantics {  },
    currentValue = currentValue,
    onValueChange = onValueChange,
)

And that’s it! This Progress Stepper is more tricky than the others to build, but allows for smooth navigation while resolving the problem of complex gestures. The main drawback is that the nature of the gesture, swiping up or down, may be in conflict with potential headings semantics in the screen, which allow the user to quickly navigate from a title to another by using the very same gesture. In that case, when focused on the Stepper, the component takes priority over the headings.

Conclusion

To conclude, here is a table comparing the different versions of the Stepper and the criteria which would help you to choose which version suits you best.

Press enter or click to view image in full size

Stepper benchmark

In my case, the Progress Stepper was the most relevant choice as it provides the best user experience and as my company is deeply engaged in providing an accessible app to its millions users, in addition to being subjected to match the RAAM referential.

Finally, the decision would depend on your specific context. Do you need to match legal requirements? What is your user’s audience composed of? How much time can you allocate to accessibility development? The choice is up to you.

You will find here the different links related to this article:

Thank you for reading this article and stay committed to making your app accessible for everyone!

Enhanced accessibility gestures for custom components on Android was originally published in leboncoin tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Once Upon a Chat Bot: The Ada Story at leboncoin

leboncoin tech — Sun, 28 Sep 2025 17:24:22 GMT

By Reka Halmai and Julien Foenet, Machine Learning Engineers at leboncoin

Imagine having a digital colleague who can answer any question, unearth buried documents, and demystify your company’s jargon — all in seconds. That was the dream at leboncoin when we set out to create our own internal chatbot, Ada.

This article tells the story of Ada: how the idea took shape, the technical and organizational challenges we faced, and why we ultimately decided to bring the project to a close. In many ways, Ada’s journey reflects the early adventures (and misadventures) of enterprises experimenting with large language models.

Ada’s icon at leaboncoin

1. The Rise of Ada

Why Build an Internal Chatbot?

In the GenAI boom of 2023, excitement around large language models (LLMs) was everywhere. But for data scientists (DS) and machine learning engineers (MLE) at leboncoin, the hype came with a dose of caution.

As the corporate world rapidly embraced public LLMs, several high-profile incidents revealed how easily sensitive data could slip into — or be exposed through — these platforms. Samsung, for example, made headlines in spring 2023 after engineers inadvertently pasted source code and internal meeting notes into ChatGPT, prompting a company-wide ban on public chatbot use.

That episode — and others like it — highlighted a critical lesson: without proper controls, inputting confidential material into public GenAI services risks irreversible leakage.

This concern was front and center for us. We wanted to explore LLM capabilities, but we needed to do it within a safe, controlled infrastructure that protected both our data and our teams. And that’s how our journey with Ada began.

A few MLEs kicked off work at the end of 2023 on what would become Ada, our in-house chatbot assistant. Why a chatbot? Because internal chatbots turned out to be an ideal playground:

To test boundaries: Chatbots are a hands-on way to uncover the strengths and weaknesses of LLMs in real-world usage — while staying within a closed environment.
To upskill teams: For DS and MLEs new to LLMs, building an internal chatbot was the perfect entry point. It was practical, fast-moving, and deeply educational.

Ada became our safe launchpad into GenAI — a way to learn fast, without compromising on trust or security.

Hosted or Self-Hosted? The Model Dilemma

But building Ada quickly brought us to a critical question: How would we run the underlying LLM? Should we host a model ourselves, or rely on an external provider?

This wasn’t just a technical choice — it was deeply tied to the very reason we started Ada in the first place: data control. If Ada was meant to be a secure, internal assistant, we needed to be confident in how and where the model was running.

Early Experiment: We tried to host Meta’s Llama 2 model on AWS. It seemed promising — total data control, potentially lower long-term costs, and the geeky satisfaction of running a giant model on our own hardware.
What Actually Happened: Hosting Llama 2 brought a world of pain. Deployment was complex, performance lagged (especially for chat), and scaling was tricky. At the same time, we compared Llama 2’s answers to Anthropic’s Claude 2, accessed via AWS Bedrock.
The Results: Claude 2 consistently outperformed our in-house Llama 2 deployment — especially in terms of conversational quality, nuance, and reliability. While hosting Llama 2 gave us more control over data, it also came with a significant infrastructure cost, requiring large cloud instances to stay performant. Even with optimizations — like turning off compute during off-hours — the overall cost was higher than using a managed solution like Bedrock. The superior performance and more sustainable cost model made the shift in direction clear.

Security & Privacy: A European Concern

For a French company like leboncoin, data residency isn’t just a checkbox — it’s essential. Our goal for Ada was ambitious: to create an assistant for all employees, able to answer questions that might include sensitive details, client information, or confidential projects.

But that meant we had to be absolutely sure about where our data was processed and stored. In late 2023, there were major concerns about sending any company data to US-based servers, especially when it could contain personal or sensitive information.

So, our legal and security teams conducted a thorough review of AWS Bedrock and its available models. Claude was selected primarily for its performance, but the choice was also made possible thanks to specific guarantees from AWS — particularly that our data would remain within the AWS ecosystem and would not be used for model retraining. These conditions ensured the deployment met our legal and privacy requirements for a pilot.

Ada Is Born

Ada, a secure Claude-powered assistant built on AWS Bedrock, brought a ChatGPT-like experience to leboncoin — with one key difference: everything stayed private. No conversations were stored — not even by the development team — which ensured strong privacy but also introduced complexity, especially when investigating user-reported issues. It quickly became a trusted digital coworker, answering internal questions while protecting sensitive data. The project also fueled learning in the ML community and inspired new GenAI initiatives through shared insights.

2. Ada Expands Her Mission

The shift towards specialization

As Ada’s usefulness became more apparent, so did her potential.

What started as a general-purpose assistant quickly caught the eye of leboncoin’s leadership. They saw in Ada not just a chatbot, but an opportunity: a safe, internal playground to explore the real-world capabilities of GenAI. At the same time, more and more employees began adopting Ada in their day-to-day work, and expectations for new capabilities quickly arrived.

By mid 2024, it became clear that to remain relevant and useful, Ada needed to grow beyond answering generic questions. The logical next step? Connect her to our internal knowledge base and specialize her intelligence. That’s how the idea of creating domain-specific assistants — powered by Retrieval-Augmented Generation (RAG) — came to life.

To support Ada’s growing role, a dedicated team was spun up, composed of three MLE and two software engineers (plus a PO and manager). That’s not a lot, but still enough to achieve great things. This shift marked a new phase: Ada would no longer be a generic assistant, but a gateway into our internal knowledge.

Assistant, Meet Your Sources

To make Ada truly useful across the organization, we created a dedicated assistant for each internal data source — each with its own design and purpose. While all were built on RAG principles, their implementations varied to suit the structure and usage patterns of their respective domains. Each assistant targeted a specific audience and addressed real, recurring information needs.

Press enter or click to view image in full size

Schema of the different RAG systems

Below are a few examples:

Classic RAG with Rephraser/Retrival/Reranker: A classic RAG setup that indexed documents from our internal sources Confluence & Lumapps. Documents were chunked, embedded and stored in a Postgres vectorDB, enabling semantic retrieval based on similarity. A reranker was added to enhance the retrieval phase, and a query rephraser proved to be valuable in the Lumapps case. This system was chosen for the design of our Confluence assistant (product & tech docs) as well as our Customer Relations assistant (moderation content from Lumapps).
Lexical retrieval using OpenSearch engine: A different approach was applied to build our Backstage assistant, by leveraging the OpenSearch search engine & index. Since Backstage documents (tech docs) were already indexed with OpenSearch, instead of using embeddings, we retrieved content similar to the user input query by running a “search” API call. We also used a reranker, as well as a keyword rephraser to help this lexical retrieval.
Augmenting the context with the whole document base: In some cases, we skipped retrieval entirely and loaded the full document base into the model’s context. This is only feasible when the corpus is small enough to fit within the large context windows of modern LLMs. We used this approach for the Policy assistant (static files uploaded directly) and the Org Chart assistant (one small OpenSearch index loaded entirely), ensuring complete coverage without the retrieval step.

Our assistants rely on Anthropic Claude models via AWS Bedrock. Some of these architectural choices will be explored in more detail in the following sections, with the lessons we learned along the way.

Expanding Access Through Slack

As Ada’s adoption grew, so did the need to make her more accessible. Initially available only through a standalone web interface, she required users to step out of their usual workflows to ask a question. To remove that friction, we brought Ada directly into Slack, by integrating the multiple assistants into specific Slack Apps.

We also developed custom Slack features such as thread and channel summarization. These tools allowed employees to select a range of messages with options like time period and report type (concise or extended summaries) and receive the results as ephemeral messages — they quickly became a very popular feature.

3. Lessons Learned

a. You Don’t Always Need a RAG System: The Backstage Assistant Example

One of the biggest lessons from building Ada and related assistants at leboncoin is that a RAG pipeline isn’t always the right answer. Our experience with the Backstage and Organizational Chart assistants is a perfect example of this.

Why Not RAG? The OpenSearch Case

Backstage is leboncoin’s internal developer portal based on the open-source framework from Spotify — a hub for technical documentation, onboarding guides, API specs, and more. When we wanted to build an assistant for Backstage, we had two choices for surfacing relevant information:

Classic RAG: This would involve retrieving all relevant content (like tech docs and onboarding materials), embedding it, storing it in a vector database, and using semantic search to query it. While powerful, this approach would require building and maintaining an ingestion pipeline, managing a vector database, and covering additional costs related to embeddings.
Leverage OpenSearch: Since Backstage already integrates with OpenSearch — an open-source engine optimized for keyword search and analytics — we chose to leverage that. It offered a much simpler alternative to classic RAG with less overhead and no need for new infrastructure, which reduced deployment time of the MVP to as little as a sprint.

What is OpenSearch?
OpenSearch works by indexing documents and enabling fast keyword-based searches. It’s optimized for log and document search scenarios, using inverted indices, tokenization, and stop-word removal (in our case, mainly in English). This makes it efficient, but also limited compared to semantic search or multi-language needs.

The Multi-Language Challenge

Most of our documentation is in English, but queries from users come in both French and English. OpenSearch’s preprocessing is optimized for English (stop-word removal, stemming, etc.), so French queries often produced poor matches out-of-the-box. Simply using keyword search meant we missed relevant documents, especially for non-English queries.

We decided to see how far we could push the OpenSearch-only approach. Instead of immediately switching to a full RAG pipeline, we built a query rephraser. The idea is simple: before hitting OpenSearch, we run the user’s question through a lightweight English keyword rephraser, making the query more compatible with OpenSearch’s index.

This significantly improved our retrieval performance — raising context relevance of 0.63 to 0.73 and returning the correct source link in 70% of cases. With our final setup (OpenSearch + Rephraser + an additional Reranker), we could achieve a more accurate and consistent result on multilingual datasets, achieving results on par with traditional RAG pipelines.

Beyond Documents: Organizational Knowledge as a Graph

Our Backstage portal isn’t just for technical documentation. It also represents the organization of leboncoin itself — teams, squads, crews, and leadership — structured as a graph. This added a new challenge.

Why keyword search fails: OpenSearch can’t natively answer questions like “Which teams belong to this crew?” or “Who does this MLE report to?” because it only sees each node (document) and its direct parent, not the full hierarchy.
Our solution: For organizational questions, we loaded and cached the entire org graph into the assistant’s context. This way, Ada could answer complex queries about team structure, reporting lines, and more — without needing extra lookups or worrying about language differences (since Claude, our LLM, is bilingual).

Sometimes, you can get great results by thoughtfully combining traditional search tools with a few clever tweaks: preprocessing queries, reranking results, and structuring data for context-aware answers. In our case, this approach was faster to build, more resource-efficient, and flexible enough to support multi-language queries.

b. Evaluation Is Not Optional: How Custom Metrics and Datasets Made Ada Smarter

In the world of GenAI, evaluation is often overlooked — but it’s what separates a cool demo from a reliable product. We quickly realized that robust evaluation wasn’t just a nice-to-have; it was essential to iterate, catch edge cases, and build trust in Ada’s answers. Our team dedicated multiple sprints to build and develop a robust evaluation frameworks.

At first, we relied on off-the-shelf datasets (one per data source, and built mainly with question/answers synthetically created from the source) and popular metrics like cosine similarity, answer relevance, and context relevance. But they didn’t reflect our real-world challenges.

Datasets tailored to our needs

Some use-cases required their own datasets. For instance, the Confluence assistant struggled with table-based queries because chunking removed headers and lost structure of the arrays — something we only caught after building a dataset dedicated to those cases. Similarly, our Backstage dataset needed to reflect multilingual queries, especially in French, so we created two separate datasets, one for English queries and one for French.
We also started to gather more user feedback to populate the datasets, which made them much more realistic and reliable.

Rethinking the metrics

Some metrics, like cosine similarity (between the LLM generated response and the expected response) or answer relevance (an LLM-as-a-judge metric comparing the answer to the question), looked good on paper but failed in practice. They didn’t catch hallucinations or weak retrieval. In the end, we focused mainly on two meaningful signals:

Correctness: a metric using an LLM as a judge to evaluate how closely the generated answer matches the ground truth. This metric is our main metric, reflecting how well the whole system is working. It plays a central role in our experiments and has consistently proven to be a reliable indicator of when the system is performing well or poorly.
Correct Links Pulled: a custom metric to track whether the right sources were retrieved and used (counts the number of links present in the answer, versus the expected links). With this metric, we can actually have a good and precise measure on the retrieval step of our RAG systems. In practice we found this metric more reliable to judge the relevant context than some LLM-as-a-judge metrics like context relevance or groundedness.

Tooling matters

We evaluated often and iterated quickly thanks to Langsmith. Its asynchronous evaluation was a real game changer (reducing the time for evaluating a dataset of 120 examples from 30min to 3min), letting us run batched, reproducible experiments efficiently despite the high latency of LLMs. We also used Airflow to schedule a weekly evaluation baseline. In the end, our evaluation process was fast, smooth, and precise.

Beyond the First Match: Improving RAG with Rerankers and Rephrasers

When building RAG systems, two things matter most: retrieving the right chunks and asking the right question.

Rerankers

Regardless of the data source — whether a vector database like the one used in the Confluence assistant, or a keyword-based engine like OpenSearch in the Backstage assistant — we found that retrieval alone was not sufficient. First-pass methods such as cosine similarity or keyword matching were fast, but often failed to surface the most relevant content.

To improve both the quality of retrieved context and the accuracy of generated answers, we introduced a second step: reranking. This transformed our system into a two-stage retrieval pipeline. The first stage performed a broad selection of candidate chunks, while the second stage used a reranker to reorder those chunks based on their actual relevance to the user’s query.

For this reranking step, we chose Cohere’s Rerank 3.5 model, which leverages a technique called cross-encoding. In this approach, the model evaluates each document (or chunk) directly in relation to the user’s query, generating a relevance score that reflects their semantic alignment.

The implementation process was straightforward and completed within a single sprint, including the evaluation of retrieved and re-ranked document volumes. While the impact on system latency was minimal, the improvements in answer quality and reliability were substantial.

That said, rerankers come with trade-offs. They add some extra cost, and unlike our main LLM (Claude Sonnet, with a 200K token window), the reranker model we used — Cohere Rerank 3.5 — has a 4,096-token limit. This restricts the size of the user query accepted by the system, and thus impacts the assistant usage. In short: better relevance, but with constraints.

Rephrasers

In some cases, query formulation itself was the problem. Our moderation assistant, for example, struggled with overly generic queries like “Is an ad about selling a shotgun allowed on the website?” These pulled irrelevant chunks due to noisy keywords like “ad” or “website.” A simple rule-based rephraser that stripped generic terms fixed this and boosted correctness by 10%, while a more complex rephraser (LLM-based), was only able to achieve a 3–4% improvement.

As mentioned before, another case was our Backstage assistant: we added a rephraser that auto-translated queries before running the search, which significantly improved retrieval quality.

That said, rephrasers must be used with care. When retrieval already performs well, adding a rephraser can hurt. We observed a drop of performance when adding a prompt-based rephraser into our RAG Confluence assistant (the retrieval from user query was already working well in this case).
The lesson? Rephrasers are a precision tool, not a blanket solution.

4. The Sunset of Ada

Artificial intelligence is a field that’s reinventing itself at breakneck speed. New models and platforms appear almost monthly, each pushing the boundaries of what’s possible. By mid‑2025, the enterprise AI market is estimated at €97 billion in 2025, expected to grow over 18% per year to reach €229 billion by 2030 (source: Mordor Intelligence).

At Leboncoin, the message from leadership was clear: every team should be able to have its own assistant or chatbot. A few MLEs couldn’t keep pace with demand. That’s when alternative onboarding tools became seriously considered.

Experimenting with Onyx and Alternatives

To deliver faster, our team tested a few platforms that allow for data source connections and custom assistant creations via a UI. Early in 2025, we began testing Onyx, an open-source enterprise AI assistant and search platform. Designed to connect with tools like Slack, GitHub, Confluence, and various document repositories, Onyx offers robust cross-platform data retrieval. We selected Onyx for initial testing with a group of pilot users to evaluate its capabilities in real-world scenarios.

A few months later, we deployed Onyx, ran the pilot with test users, and evaluated its features. While it showed promise, our decision to self-host to assure data privacy came with significant infrastructure demands — especially due to the use of the Vespa database, which introduced complexity and affected the solution’s stability in production. Customization was also limited, and in the end, Onyx wasn’t the right fit for leboncoin at this stage.

ChatGPT Enterprise Becomes the Strategic Default

Meanwhile, OpenAI launched data‑residency in Europe for ChatGPT Enterprise and the API — ensuring data remains fully within EU boundaries, meeting GDPR, SOC 2, and CSA‑STAR requirements. This change made ChatGPT a viable enterprise solution for European‑based workplaces.

As our team developed strong expertise in generative AI over the course of Ada’s development, we reached a point where it made sense to redirect those skills toward more user-facing use cases. Maintaining an in-house assistant like Ada came with a significant overhead — both technical and operational. Thus, at the end of the first quarter of 2025 the decision was made: Ada would be gradually phased out, and a new internal assistant would center on OpenAI’s platform, allowing the team to focus on higher-impact projects.

Today, we’re working on transferring Ada’s “secret sauce” to ChatGPT and the broader AI stack:

Several features will be ported via Model Context Protocol (MCP) connectors, allowing integration between ChatGPT and internal APIs.
Others will be reimagined as Custom GPTs built atop ChatGPT’s action‑enabled frameworks.
And for workflows and automation, we’re exploring n8n, an open-source automation engine, to orchestrate triggers across internal systems without spinning engineer-heavy infrastructure.

Conclusion: Ada’s Legacy

As we say goodbye to Ada, we’re not just turning off a tool — we’re closing a chapter that taught us a tremendous amount about Generative AI in the real world.

This project was a high‑velocity learning accelerator. It gave us a fresh exposure to LLM infrastructure, practical prompt engineering, LLM experimentation, RAG and indexing pipelines, security governance and legal compliance within EU law, and the possibilities and the limitations of Generative AI.

Ada’s journey wasn’t long — about a year and a half–but taught us, above all, that deploying LLM backed apps is far from plug‑and‑play. It also reminds us that in AI, sometimes the journey is just as valuable as the destination. Here’s to the next adventure.

Once Upon a Chat Bot: The Ada Story at leboncoin was originally published in leboncoin tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

How leboncoin paved the way for the deployment of large-scale accessibility

leboncoin tech — Tue, 24 Jun 2025 15:32:36 GMT

By Laura Gautheron, Staff UX designer at leboncoin

For the past year, leboncoin has started focusing more intensely on accessibility. In my capacity of Design Ops, I’ve been in charge of leading this project, and I’d like to share my feedback on what we’ve achieved, the hits and misses, and what we’re planning for the future.

This article retraces the steps of this project:

from defining the status and goals,
creating the right setup
training and supporting the teams,
launching it all to start improving our platforms,
and finally, assessing our work and planning the next steps.

👉 The first challenge: defining our goal

A snapshot of September 2024

In September 2024, when we decided to tackle accessibility, our north star was a bit blurry, although we had an idea of how far we needed to go thanks to an audit carried out two years ago by Access42. But since then, a few changes in our strategy and organization have happened, which have made it hard to action it.

To start afresh and build a proper backlog of upgrades, we decided to audit our web and mobile platforms. We asked Ethic First, a company specialising in accessibility, to audit our apps, and an internal member, Romain Saillant to audit our website.

A legal framework causing indecisiveness

It didn’t help that no one agreed on the interpretation of the law governing the regulations. To get out of this fog, our legal department, various authorities and other industry players discussed the matter for several weeks.

We finally decided to err on the side of caution, aiming to make all our user flows compliant by June 2025 — a very ambitious goal. Our next step was setting goals the teams could achieve.

Focusing our goals on user flows rather than on the grade

Since the law required “accessible service”, I decided to stress the whole user journey, and not just the quick wins to increase our conformity rating. We would focus on making the audited pages compliant, putting the priority on the main flow (connection, ad creation, search and purchase of a product).

In order to have a visible impact on the main grade and to reassure both the teams and management, I’d advise to prioritize a few quick wins, while still keeping in mind the whole user journey.

For internal management purposes, it seems better to use the average rate rather than the overall rate. The average rate gives good visibility of the progress made on each page.

👉 Creating the right set-up

Finding sponsors

The set-up was a crucial step in this project.

As a Design Ops, I’m not aware of budgets and I don’t have any visibility on the elements influencing the roadmaps. I asked for the sponsorship of a Product Director, Ludovic Hazart (who will take on the responsibility of the Accessibility referral moving forward), and a Technical Director, Arnaud Dars, to support me in overseeing the project.

We met twice a month for a management meeting, where they would help unlock budgets, send the right information to their teams and, most importantly, prioritize accessibility within the roadmaps.

If accessibility isn’t explicitly included in the roadmap, it remains confined to isolated design or development initiatives, without being implemented across the entire user journey.

To succeed, it’s crucial to include sponsors with real influence over product prioritization and budget control.

Training the teams

Quite early on, we realised the teams didn’t have the skills needed to complete these changes. To empower them in this process, and to reach the level of improvement we were aiming at, massive training was required for everyone.

So we organised training at scale for nearly 200 collaborators, all professions combined: UX, Web Developers, iOS and Android developers, Quality Analysts and Product Managers, with the help of our partners Ethic Fist and Access42.

The training courses were spread over the end of 2024 to the beginning of 2025 and were held by Ethic First and Access42, our expert partners in accessibility. When we were first introduced to the topic, people’s knowledge was very uneven. We therefore had to think in terms of support for teams embarking on the project.

Supporting teams with ambassadors and tools

Accessibility is a multidisciplinary issue, so I chose to recruit passionate ambassadors in each discipline : iOS, Android and web Developers, QA specialists, Product Managers and UX designers — each bringing their own unique perspective to enrich our approach.

Their role is to be the point of reference on the topic for all members of their profession, and to share our progress as well as accessibility best practices.

The next step? Turn their energy into an organized, structured movement. We methodically built the foundations of a sustainable accessibility culture: documentation, clear processes, and above all, a collaborative mindset.

Here’s the set-up in a nutshell:

- an audit to build up a backlog of corrective measures,

- a steering commitee to lead and promote the issue,

- a reference team to support various professions,

- extensive training to develop in-house skills and shift towards autonomy.

👉 Launching the project

Raising awareness, creating enthusiasm

Manuel Perreira, Accessibility Manager at the Valentin Haüy Association, had already spoken at an event organized by leboncoin’s accessibility ambassadors (a11y) last November. Together with the audit carried out two years ago, it had already made the teams aware of the subject.

Conference at leboncoin with a live demo of accessibility issues by Manuel Perreira

To kick off the new quarter, we invited all the teams to a presentation of the audit results with the help of our internal communications department. The goal was to motivate them, as they would have to deal with teething problems during the first quarter.

Presentation of the audit results for our mobile apps by Ethic First

A trial and error phase

The beginning of the process required a lot of resilience and a transformation of our practices. For newly formed teams, the novelty of the subject generated a lot of back-and-forth, questioning, and backtracking. However, thanks to the ambassador group, the tools we developed, and our partner’s support, we were able to make progress on the corrective measures.

The big challenge was capitalizing on the teams’ trial and error to avoid repetitive questions by sharing effective solutions as soon as they were found. To this end, we set up a continuously updated documentation system with patterns and solutions that teams could use. We also had a Slack channel to discuss our findings. In addition, we changed our “definition of done” at the design or development stage by creating new processes and providing tools for the various disciplines.

Here’s how I achieved that as a designer:

I set up an “A11Y Checklist” in one of our Figma libraries. We fine-tuned it during a training session with Access42. The checklist links to our documentation and provides explanations of the concepts.
I’ve also organized workshops with the designers for them to get more familiar with the available plugins and define a common template for writing our specifications.
As it’s difficult to adapt to these new methods immediately, especially when the roles of designers and developers overlap, we are setting up a systematic review process for screens before sending them to development. This process will give everyone time to adapt to this new stage of design.

An overview of the Designer Checklist component, available in our in-house Figma library

Example of accessibility specifications, using Figma annotations and plugins such as Focus Order

The same kind of tools and workshops have been set up in other disciplines like QA and development.

Scaling the process with our design system

Since 2022, leboncoin has invested significantly in creating and implementing Spark, our design system.

From the outset, it was conceived with accessibility in mind, and without it, we could not have taken on this project. Although this design system wasn’t perfect, the upgrades made to its components gave us a boost and improved the user journeys’ accessibility.

Focusing on this topic also had the benefit of accelerating adoption of the design system. A good example of this is what we call the “scrolling list”, which has accessible navigation patterns and can be seen on the homepage. If you want to learn more, have a look at our demo and documentation.

Example of accessibility documentation in our design system

Example of Figma specifications for navigation behaviors

👉 Our goals for 2025: Compliance and autonomy

Performing a multi-year accessibility plan enabled us to identify the challenges for this year:

continue to improve our compliance score
work on the usability of our platform for disabled users
become autonomous in our self-assessment.

Assessment

After almost six months of development work, it’s time for an early assessment.

While teams and stakeholders are eager to see how things are progressing, it is difficult to provide an accurate evaluation without conducting a counter-audit.

Here’s what we can already observe:

Out of the twenty or so teams who have tackled the subject, 200 Jira tasks have been finished, 80 tasks are to be completed or are in progress.
Our internal auditors have counter-audited the most important journeys and have gone back and forth with the teams involved.
Based on information provided by the teams, I have estimated our progress and noted improvement in areas where the highest investment has been made, for example in our web ad submission system.
An audit will be carried out during the year to assess progress.
The difficulty lies in demonstrating significant ROI, as management is aware of the project costs, but the overall score is difficult to increase.

Our users

Rather than taking a real user-centric approach, this project was driven by risk reduction and the implementation of accessibility principles, resembling a technical overhaul.

In 2025, we plan to implement tools that will enable us to collect and analyse quantitative data by tracking the behaviour of users navigating our platform with a screen reader. This will provide insight into the number of users involved, help us identify friction points in the user journey, and reveal their habits.

With the help of the Valentin Hauÿ Association, we will conduct a first qualitative user test session in September to evaluate our processes with real users, once the teams have further increased compliance. This will come at just the right time to help revitalise the project and boost team commitment. Moving forward, we are considering ways to include people with disabilities in our user tests, either via dedicated sessions or by including one representative per test.

Moving towards audit autonomy

Currently, we lack the in-house expertise to accurately assess our progress. Three people have been trained in auditing, but they do not have enough dedicated time to work on it.

This year, we intend to increase the number of internal auditors. At the same time, we are exploring the potential of AI to automate some diagnostics, both at the wireframe stage and in production.

👉 Looking back

If I had to do it all over again,

these are the things I would do differently:

Audit: the Ethics First sample differed from the web sample, which was based on the 2022 audit. In retrospect, I would use the same sample wherever possible to simplify the progress analysis phase.
Progress visibility: from the outset, I would establish ways to provide management with a visibility of the project’s progress, highlighting the average rate per page and using it as an internal communication tool.
Autonomy: I would start training volunteers in auditing right from the beginning.

… and the things I would do the same way:

Involving sponsors from the product and tech management,
Setting up massive training,
Creating a group of motivated multi-disciplinary ambassadors to rely on.

This article reflects the current status of the project, which is still ongoing, including the upcoming publication of our 2025–2027 multi-year plan and the subsequent update of our compliance rating, which does not yet reflect the efforts made by the teams mentioned above.

How leboncoin paved the way for the deployment of large-scale accessibility was originally published in leboncoin tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Inside Our Journey to OpenTelemetry: Observability at Scale (1/3)

leboncoin tech — Sun, 25 May 2025 15:18:08 GMT

by Paul Drappier and Alexis Destrez, Lead Backend Engineers at leboncoin

Paul Drappier and Alexis Destrez are backend guild leaders at leboncoin, a marketplace that simplifies access to second-hand goods.

As backend community leaders at leboncoin, both are responsible for establishing technical governance, maintaining high quality standards, and ensuring the coherent and modern development of the backend community.

When we joined leboncoin in 2017, the back-end guild had 35 members. Seven years and a migration to Go later, it has grown to 120 engineers across more than 60 teams, managing the deployment of more than 900 applications.

In this article, we’ll explain why and how we implemented observability using OpenTelemetry at leboncoin.

Why we Decided to Migrate to OpenTelemetry

As part of our ongoing effort to enhance observability across our diverse technology stack, we undertook a significant migration: transitioning to OpenTelemetry. Previously, our Go applications sent metrics and tracing data to Datadog Agents, which then forwarded it to our Datadog instance. This setup served us well for a time.

However, with a company-wide initiative to rationalize our marketplaces and establish a unified platform, we needed a standardized approach to observability that could accommodate multiple languages, including Java and Kotlin, alongside our existing Go applications. Our goal was to achieve greater flexibility, efficiency, and scalability in our observability practices.

The following examples illustrate the aggregation complexity before the migration. Each application prefixed the metrics it generates with its name; to aggregate all HTTP server responses, we had to manually sum metrics from all applications. As the number of applications grew (nearly a thousand), it became unfeasible nor maintainable to have such aggregated representation.

After the migration, such an aggregated view of the traffic of all applications can be made using only one metric.

OpenTelemetry emerged as the ideal solution to address these needs. Beyond enabling us to consolidate our observability practices across languages, it offered crucial advantages:

Standardization and interoperability: OpenTelemetry aligns us with industry standards, ensuring interoperability with a wide range of observability platforms and fostering a more open and collaborative ecosystem.
Greater control and ownership: OpenTelemetry empowers us with greater control over our telemetry data, enabling us to route and process it according to our specific requirements.

With this migration, we were aiming at:

Improving Naming Conventions: establishing a robust and standardized naming convention for our metrics to enable efficient aggregation and analysis across our diverse services and stacks.
Rationalizing Metrics Strategy: streamlining our overall metrics strategy to eliminate redundancies, and ensuring we collect the most valuable data for understanding our application performance.
Unifying Dashboards: creating shared dashboards maintained by the back-end guild that all teams could easily adapt and use.

How we approached the Migration of our Entire Monitoring Stack

Our migration to OpenTelemetry was a carefully planned, iterative process, driven by the need to minimize disruption and maximize learning along the way. Given our objectives, the interconnected nature of our services, and the extensive use of shared libraries within our monorepo, we adopted the following migration approach:

Phased Rollout: Given our monorepo’s scale (over 900 microservices) and the reliance on critical alerts on existing metrics, a phased migration was essential to minimize disruption and allow for careful validation at each stage.
Prioritizing Critical Libraries: We started with the most widely used libraries to maximize the migration impact and provide immediate benefits to the largest number of teams. For example, migrating the HTTP server library, used by 100% of our services, ensured that all feature teams would see value from the migration early on.

This step-by-step approach was structured as follows:

Phase 1 — Laying the Foundation: Migrating six key libraries.
Phase 2 — Expanding Coverage: Extending the migration to the remaining shared libraries.
Phase 3 — Addressing Team Custom Metrics: Migrating business-specific metrics.
Phase 4 — Migrating Tracing: Migrating our tracing infrastructure to OpenTelemetry.

For each of these 4 phases, we followed the same key steps: conducting a proof of concept, implementing the library, running a pilot phase (in dual-run mode), and then rolling out the changes in production.

During the project, our focus evolved to include a greater emphasis on observability governance. The move towards common metrics highlighted the need for stronger oversight of how metrics were being generated and tagged, as teams previously had complete autonomy in defining their metrics, leading to inconsistencies and inefficiencies.

By establishing clearer guidelines and best practices, we aimed to improve the quality and cost-effectiveness of our observability data.

The Benefits of this Migration

The migration to OpenTelemetry has provided a range of significant benefits, both anticipated and unexpected.

Anticipated benefits

Standardized Naming Conventions: implementing a standardized naming convention based on OpenTelemetry specifications has greatly improved metric aggregation across services. We adapted to evolving standards, emphasizing the importance of staying up to date. The resulting consistent and reliable view of application performance across all services simplified analysis, reporting, and dashboarding, fostering a more collaborative and data-driven culture.
Metric Aggregation: the standardization has significantly simplified metric aggregation, enabling the creation of generic dashboards that visualize data from all migrated services, a major step towards our goal of unified dashboards.
Unified Dashboards: we demonstrated the value of unified dashboards early in the migration, and adoption is progressing well. These dashboards, built using Datadog PowerPacks, allow teams to easily incorporate relevant parts from the unified dashboard into their existing dashboards, blending community-shared insights with their personalized views. While adoption is still in progress, early feedback and usage suggest we’re on the right track.
Leveraging Open Standards: OpenTelemetry’s open-source nature, well-documented and community-maintained framework, and adherence to industry standards have streamlined our instrumentation efforts and given us confidence in the long-term viability of our observability stack.
Vendor Decoupling: while we continue to rely on Datadog, OpenTelemetry has given us greater flexibility in choosing our telemetry backend.

Unexpected benefits

Enhanced Governance: the move to OpenTelemetry allowed us to reassess our observability practices, leading to improved dashboard reusability, reduced individual efforts, and stronger governance around data quality, cost-efficiency, and business alignment with our business goals. A dedicated observability team now oversees and monitors custom metric volumes, provides guidance on tagging best practices, and proactively addresses potential cost overruns or inefficiencies.
Increased Observability Awareness: as one of the first major cross-functional observability initiatives in our company, the migration helped teams enhance their observability skills and understanding, fostering a greater appreciation for monitoring across the organization. Teams engage more with observability, as seen in migration workshops and discussions on our dedicated Slack channel. The prioritization of observability in team roadmaps further reinforces this positive trend.
Cost Optimization: the migration unexpectedly revealed inefficiencies in our custom metrics usage. By analyzing our historical tags and their cardinality, we identified and excluded unused or inefficient tags from indexing, reducing the custom metrics bill by 30%.

Focus on Billing

The OpenTelemetry migration has had a positive impact on our observability costs, albeit indirectly. The process of migrating to OpenTelemetry prompted us to take a close look at our Datadog billing and identify areas for optimization.

While the migration itself introduced some temporary cost increases due to dual-running metrics during the pilot phase, these were manageable and outweighed by the long-term benefits.

Our biggest cost savings resulted from better governance of tagging and custom metrics. By analyzing our metric usage, we identified several high-cardinality tags that were significantly increasing our Datadog costs. Removing these unnecessary tags led to a 30% reduction in our custom metrics bill.

We anticipate that OpenTelemetry will enable even finer-grained control over our tagging practices in the future, allowing us to further optimize our costs. The migration has opened our eyes to the importance of proactive cost management in observability, and we are now committed to continuously monitoring our usage and exploring new optimization opportunities.

Pain Points Encountered

The migration to OpenTelemetry also presented some challenges. We encountered several technical and organizational hurdles that required careful attention and creative solutions.

Illustration of our encountered pain points

1. Technical Challenges

Aggregation Temporality: One of the most significant technical challenges we faced was related to aggregation temporality. This concept, which was new to us, refers to how metric data is aggregated over time. Through research and experimentation, we eventually identified this as the root cause of our metric loss issues and resolved it by adjusting our configuration. This experience highlighted the importance of a deep understanding of OpenTelemetry’s intricacies. We’ve detailed this issue and its resolution in the next article (it will be published very soon).
Latency Metrics: We had been using gauges for latency metrics, a practice that was not ideal. OpenTelemetry recommends using distribution metrics for latency, which we adopted. However, we initially used the default OpenTelemetry bucket configurations, which proved insufficiently granular. This led to discrepancies in average latency calculations compared to our previous (albeit flawed) approach. After experimenting with custom bucket configurations offering greater granularity, we were able to obtain latency metrics that are more accurate and meaningful. For more details on how we identified and solved this issue, you can read our third article of this series (it will be published very soon).
Collector Choice: We initially hoped to leverage Datadog Agents to send OpenTelemetry metrics from our pods. However, we ultimately opted for an open-source OpenTelemetry collector, which required additional setup and configuration and added unexpected setup time.
Custom tagging needs: One challenge we’re currently tackling is how to accommodate custom tagging needs that are not yet covered by the standard OpenTelemetry metrics. We’re exploring various strategies, including adding custom tags to existing metrics or creating separate metrics for specific use cases.

2. Organizational Challenges

Securing Team Buy-in: it was a key challenge, which involved convincing teams to participate in the dual-run pilot phase by demonstrating the benefits of OpenTelemetry and ensuring minimal disruption to their workflows.
Organizational Overhead: Migrating 900+ microservices across a large organization with diverse teams required extensive planning and coordination, and securing buy-in from technical leadership to allocate sufficient time to team roadmaps. We faced challenges in prioritizing the migration work across all team roadmaps, as the effort required varied significantly based on team size, existing observability practices, and the number of dashboards and alerts to migrate.
Planning and Resource Allocation: Estimating the migration effort for each team also proved difficult due to varying levels of familiarity with observability tools and the number of dashboards/alerts requiring migration. The actual time spent often exceeded initial estimates, leading to some friction. This highlights the importance of thorough planning and potentially involving representatives from different teams in the estimation process.
Knowledge Sharing and Support: Supporting teams throughout the migration required significant effort. Communicating effectively about the migration progress, challenges, and solutions was crucial. We created extensive documentation, offered regular workshops, established a dedicated Slack channel, and held open desk sessions to help teams adapt to the new observability stack. While these initiatives were helpful, ensuring consistent knowledge sharing and addressing individual team needs remained an ongoing task.

3. Surprises and Lessons Learned

Documentation Gaps: Due to the relatively recent adoption of OpenTelemetry, some features lacked detailed documentation. While general OpenTelemetry documentation is comprehensive, troubleshooting niche problems sometimes require deeper investigation and community engagement.
Governance: The migration unexpectedly highlighted the need for stronger observability governance. We realized that the previous laissez-faire approach to metric generation and tagging was unsustainable. This realization, while initially adding complexity to the project, ultimately led to significant improvements in our overall observability practices and cost optimization.
Metric Deep Dive: The migration forced us to take a closer look at our metrics, their impact on our Datadog bill, and how we could optimize our usage. This resulted in a 30% reduction in our Datadog bill through simple, low-cost actions.
OpenTelemetry Limitations: We learned that even though OpenTelemetry provides a robust standard, it does not cover 100% of every company’s specific needs. We need to plan for the possibility of having to add custom tags and potentially additional metrics to meet specific requirements.

4. What We Would Do Differently

If we could do it again, we would certainly:

Improve the Pilot Phase: Strengthening objectives and rigor would have helped identify issues like metric aggregation and other discrepancies earlier.
Enhance Tracking during Migration: Automating progress tracking would provide a clearer view of team status and required support.

5. Advice for Others

Our advice to other companies considering an OpenTelemetry migration would be:

Perform a comprehensive assessment of your current observability stack. Identify the types of metrics you collect and determine which ones comply with the OpenTelemetry standard. For those that don’t, consider how you will handle them.
Evaluate your teams’ observability maturity and provide training and support as needed to ensure that teams are comfortable with the tools and concepts involved in the migration.
Plan for the unexpected. Be prepared to address both technical and organizational challenges.

What’s Next

Looking ahead, we envision a future where observability is deeply ingrained in our development and operations practices, empowering teams to build and maintain highly reliable and performant services. Our OpenTelemetry migration is a key stepping stone towards this vision.

While migrating logs to OpenTelemetry is not an immediate priority due to the current volume and complexity of our log data, we recognize the importance of this step in the long term. We plan to address this by implementing stronger governance around logging practices, aiming to reduce log volume and improve data quality before considering a migration.

In the near term, we will focus on migrating our tracing infrastructure to OpenTelemetry. This will provide valuable insights into request flows and dependencies, further enhancing our ability to understand and optimize application performance.

We anticipate that the increased visibility and standardization offered by OpenTelemetry will spark greater interest in observability across the organization. As more teams embrace observability, we expect to see the demand for new tools, dashboards, and alerting capabilities increase. This will further strengthen our overall observability posture and enable us to proactively identify and address potential issues before they impact our users.

Ultimately, we believe that OpenTelemetry will play a crucial role in:

Improving the reliability and performance of our services: by providing teams with the insights they need to understand and optimize their applications.
Simplifying incident management: by enabling faster identification and resolution of issues.
Reducing the cost of observability: by optimizing our tooling and practices.
Improving control over our data: this forced us to take ownership of the topic of observability governance, an aspect we had previously neglected.
Enhancing collaboration: by fostering a shared understanding of application performance across teams.

Our migration to OpenTelemetry has been crucial in building a more standardized, efficient, and scalable observability framework. We are committed to continuously evolving our observability practices and leveraging the full potential of OpenTelemetry to maximize its benefits.

This article provides a general overview of our migration to OpenTelemetry. In the next two articles, we will delve deeper into specific technical challenges we encountered and how we resolved them.

Inside Our Journey to OpenTelemetry: Observability at Scale (1/3) was originally published in leboncoin tech on Medium, where people are continuing the conversation by highlighting and responding to this story.