Fake it till you make it… at scale

Adevinta
Adevinta Tech Blog
Published in
9 min readSep 29, 2020

By Tangi Vass, Senior Backend Developer

Adaptability, reliability and performance: we want them all!

Startups extensively use a strategy called “Fake it till you make it” to test the interest from potential users of their products while still in the early stages of building them. This is an effective way to go agile. By agile, I mean the repetition of two steps. First, get real user feedback before significantly investing in product development. Then, leverage this feedback to make significant changes to the initial plan without jeopardising the product’s consistency.

Adevinta provides centralised services for its marketplaces all over the world, including leboncoin in France and Subito in Italy. Our services need to be reliable and scalable from day one.

Achieving adaptability, reliability and performance at the same time is quite challenging. So how do we tackle these objectives without compromise?

Case study: Moderation at scale

We recently rolled out a manual moderation service. Millions of classified ads are created on Adevinta’s marketplaces daily! Most ads are automatically moderated but some need manual moderation. By “some”, I mean tens of thousands. Considering its cost, this process needs to be highly effective. So UX should lead the game, right?

But wait! Each marketplace has its own legacy system, mature and full featured, which cannot be replaced even transitionally by a less mature solution. This rules out usual MVP approaches.

Hum, wait again! Considering a single ad is error-prone. Moderators need context to do their job well: past ads from the user, metrics, past moderation actions, etc. That’s billions of content and events. A system at that scale cannot be crafted on the fly. So the tech team should lead the game, right?

That’s where we hit the antagonism of adaptability, reliability and performance objectives.

“Fake it till you make it” to the rescue

A bottom-up approach, from the back-end to the UI, fully implementing the features one after another, would miss feedback loop opportunities at all layers.

So we decided to start with UX mock-ups and progressively pushed the mock frontier down the layers:

  1. UX mock-ups
  2. Unmocked front-end, mocked API gateway
  3. Unmocked API gateway, mocked API (using faker)
  4. Unmocked API, mocked service layer
  5. Unmocked service layer, mocked data access
  6. Partially unmocked data access (real ids but fake attributes to support a consistent navigation)
  7. Full implementation

Along the way, the UX got more and more refined and stable. The extra cost of the temporary mock layers were negligible compared to the savings we made from not having to implement the early changes across the whole stack.

This is how we met the adaptability objective.

(Notre-Dame, virtually refactored into a movie set, from the Apparences movie by menilmonde.com)

Scaling the Fake It approach

The “Fake it till you make it” approach bought us some time to carefully design and incrementally optimise the back-end layers that we would plug only when they were good enough to replace the corresponding mocks. It also provided us with very precise use cases in a timely manner.

The team had experience building and running a similar system. The lessons they learned led this design (which was validated using a Proof of Concept):

  • a database (RDS PostgreSQL for its performance and support for transactions, Jsonb data types and resizing facilities without downtime)
  • a consumer module receiving incoming content and storing it into the database
  • an API module exposing the endpoints used by the API gateway
  • a replication module publishing the result of the manual moderation and forwarding changes to ElasticSearch
  • a task worker module processing bulk actions asynchronously
  • a search engine (ElasticSearch)

Each module is independently scalable and the whole system is designed to be highly available (with external retry facilities) and to avoid bottlenecks, especially locks at database level, since this is the only part that is not horizontally scalable.

This nevertheless results in a complex architecture, so we decided to follow a KISS strategy (“Keep It Simple, Stupid!”) and to start without ElasticSearch or asynchronous task worker. PostgreSQL has decent Full-Text Search capabilities and we could run tasks synchronously within the API module. It actually looks like “You Ain’t Gonna Need It” (YAGNI), as we still don’t feel the need for some of the technologies we had planned upfront.

We reinvested the time saved from not introducing non-essential components in carefully crafting the critical parts of the system.

Reliability is NOT ONLY about scalability

As High Availability is mandatory, we run our applications on a Kubernetes cluster. Everything is scalable, redundant and monitored. We do continuous delivery without downtime, which brings a high baseline for uptime. But uptime doesn’t ensure reliability. So how did we ensure reliability?

First, we wanted to prevent our application from breaking on a corner case:

  • We needed an expressive programming language: there wouldn’t be any bug in the nuts-and-bolts code we wouldn’t have to write! We chose Python with FastAPI and Pydantic.
  • We let the compiler / linter help us write reliable code: we systematically used static typing or type hints, and pre-commit hooks!

Second, remember that we were building a complex system. Ever heard of Gall’s law? It states that “A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.” The consequence is that we needed to start simple and grow the system while ensuring all the way that everything was working properly. “All the way” implied continuous integration, “everything is working properly” implied a good test pyramid.

The third — and often overlooked — point is that whatever care we give to our code and tests, our applications have limited resources to run. When they’re overloaded, the latency of any single processing will dramatically increase — unless back-pressure is built-in.

Back-pressure is the ability for an application to gracefully reject the requests in excess compared to what it can process with acceptable performance. The application would then behave like a weeble rather than like a serie of dominos.

In our design, the application layer is highly scalable with no contention except the fact the data access layer may hit the limits of the RDS database (which currently cannot scale horizontally). We prevented this from happening using two timeouts: Asyncpg client-side command_timeout and server_settings (to set PostgreSQL’s server-side statement_timeout) parameters, the former being a bit greater than the later.

This is how we met the reliability objective.

The technical stack

At Adevinta, each team is free to select the best stack for the job. As we don’t hire developers based on specific technological skills, we have very diverse technical backgrounds within the team.

The front-end developers quickly agreed on React, GraphQL and Cypress.

On the back-end side, we found a satisfying trade-off to the eternal dilemma between speed of development with Python and maintainability with Java:

  • Python with FastAPI / Pydantic for a good mix of speed of development, performance and reliability. FastAPI is fast and asynchronous being built upon Starlette and Uvicorn, it leverages Pydantic to bring native serialisation/deserialisation of models with automated validation, eliminates most nuts-and-bolts code using default behaviors and annotations, and brings Swagger for free. Pydantic leverages type hints (PEP 484) to bring some static typing capabilities.
  • Asyncpg is a high performance PostgreSQL client library with connection pooling, prepared statements and contextual transactions. It proved to be a very effective alternative to aiopg/psycopg which didn’t support transactions or prepared statements. We combined it with buildpg to keep our SQL code readable.
  • JsonAPI breaks the closed approach of REST with regards to resources and allows the inclusion of related data to support the fetching of a complete dataset in a single request, hence reducing the number of requests to the API. It also makes the API normalised, explorable and self-documented. Integration with GraphQL is seamless.

Performance is NOT an afterthought

There’s a famous saying from Donald Knuth: “Premature optimisation is the root of all evil”.

We’ve seen how Gall’s law makes continuous integration compulsory. It’s the same with performance. You cannot build a high performance system without having controlled performance during the building process and probably changing some data models or algorithms to eliminate the bottlenecks. Some changes may be structural, perhaps even affecting the architecture itself, and would be much more difficult to make later on. Building a MVP and trying to make it fast and scalable as a second step doesn’t work.

So how to do early — yet not premature — performance optimisation? An optimisation is premature when it is performed before any pain is felt. We don’t optimise processing for the sake of it. We make the real bottlenecks visible and remove them swiftly. That’s continuous performance improvement.

First we need data, a significant volume of it, to make measures relevant. We used an existing stream of test events to feed our integration environment and get a quickly growing volume. We also sampled it to produce a small data set for the CI test suite.

Then, as “you cannot manage what you cannot measure”, we need performance measures.

The first tool we found particularly useful was a synthetic dashboard with graphs of all the relevant metrics to make issues visible.

A part of the Datadog dashboard for the back-end of this application

The second tool we used, Zipkin, enabled zooming in. It records the execution time of every API request and subsequent SQL queries with their arguments to provide a detailed breakdown and the ability to reproduce.

Example of a Zipkin trace

Another useful tool for identifying CPU or I/O intensive queries is PostgreSQL’s pg_stat_statements built-in extension.

Query on the pg_stat_statements view

As a result, we identified without much surprise the data access layer as the main source of bottlenecks, more specifically some SQL queries.
By analysing the execution plans, we were able to optimise some of these queries by several orders of magnitude. For others, the computational complexity was the problem and pre-processing the solution.

Performance is NOT raw speed

You might be puzzled by our choice of Python for a project with high performance requirements, or surprised by the fact Python was not the bottleneck here. Python is indeed a very slow language, but speed is rarely what we need at the service layer. Think of Python’s success for Machine Learning applications, which run complex processing on large chunks of data. Why would these applications use one of the slowest languages on Earth? They actually use Python only for orchestration where speed of development prevails and rely on C libraries for resource intensive processing.

We aren’t doing things any differently. We use Python for the conditional logic and data mapping in and out. There’s no need for raw speed here, although we get the best out of Python via asyncio/asyncpg. The performance comes both from the design and from very focused optimisations.

The system we were building would have a large number of concurrent requests. We knew from past experience that database transactions are dangerous in such a context because of the locks they set. We avoided extensively relying on client-side database transactions in order to limit database locks and waits. SQL data-modifying CTE (Common Table Expression) proved to be very effective, with an implicit transaction fully restricted to the database scope and to the lifespan of a single query.

Doing costly computations close to the data may be several orders of magnitude faster than doing them in the service layer, whatever the language. PostgreSQL provides everything we may need to support this, but we still have to check the execution plan (produced by the explain command) of every single query to ensure data is accessed in the most effective way.

Combining an extensive use of SQL CTE and PostgreSQL’s native support of the Jsonb type (fast jsonb operators, index with jsonb_path_ops), we achieved extreme speed with no major bottleneck.

This is how we met the performance objective.

Conclusion

In this article, we saw — through a real case — how to build a high performance system in an agile way, by progressively moving the mocked layer down the stack and using a continuous performance improvement strategy.

Is that the end of the story? Not really. When we started to store and retrieve the historical records of the content we moderated, we identified lots of new opportunities for our other applications. We’re already building a dedicated system to manage this concern in a more scalable way, but this will be another story… So stay tuned!

--

--

Adevinta
Adevinta Tech Blog

Creating perfect matches on the world’s most trusted marketplaces.