Illustration by Gaëlle Malenfant — Doctlib

Monday, July 12 at Doctolib: a Retrospective

Published in

Doctolib

10 min readSep 2, 2021

How did the Doctolib site manage millions of connections, following a critical press conference on TV? Nicolas Martignole, Principal Engineer at Doctolib, takes you to the virtual crisis room, and offers you a behind-the-scenes look at a high-traffic site.

Like many people in France on Monday July 12, at 20:05, I tuned in for the government announcement about the new vaccination regulations. At Doctolib, we were not aware of the contents of this particular announcement. As it turns out we had something coming for us.

With Doctolib, people can conveniently choose time slots for their medical appointments, like vaccination appointments. There’s a complex system at the heart of our application. And its reliability was tested on a real life incident in mid-July.

When Doctolib CEO Stanislas Niox-Chateau appears on TV, our technical teams are informed ahead of time, and we have a small team on-call. I took this screenshot of our NewRelic dashboard on April 28, 2021. Stan was being interviewed about the vaccination campaign on the 8 o’clock news. As you can see, this results in an influx of traffic to the Doctolib site. But that’s nothing we can’t handle. When we are aware of a media event, we spin up some more servers, and call it a day.

You can think of Doctolib’s architecture as being elastic. It’s perfectly capable of reacting to peaks in load. The system automatically starts or stops cohorts of servers. We know, for example, that nights and weekends are less busy. During these periods, the site runs with fewer servers. Monday morning is typically the busiest time. The site then runs with 4 times more servers for a few hours.

While it only takes a few minutes to go from 200 to 2500 servers — enough to absorb several million visitors without any issues — the system is sometimes slow to react. This is why we try to anticipate peaks in traffic.

20:13

Back to the evening of July 12. At 20:13 I received an SMS alert, as do all team members who signed up to be on the crisis task force. As soon as an anomaly is detected, you receive a text message. As one of the available Principal Engineer at that time, I decided to log onto my computer and join the virtual crisis room on Google Meet. There were 4 or 5 of us, no-one seemed all too worried.

Among the early responders were Simoné Veronese and Florian Philippon, two engineers on the Site Reliability Engineering team. They’re responsible for our application running smoothly.

Crisis management at Doctolib is simple: there’s the Incident Manager who’s responsible for coordinating the teams and making decisions. The Communication Manager is responsible for informing non technical teams (like Sales & Operations). Any Doctolib employee can join the virtual crisis room: either to take an active role, or to be a spectator, to learn and observe. I volunteered to document all the discussions and decisions, which would later prove useful for our internal retrospective.

And, as it turns out, for this article.

20:17

The first log lines of the #1192 incident start coming in at 20:17. Using observation tools like NewRelic and Datadog we identify that the traffic spikes are concentrated on the vaccination part of the Doctolib website. We go from about 500.000 to 2.1 million visitors in 15 minutes. We’re sustaining a volume 3 times greater than when our CEO appeared on the 8 o’clock news!

The majority of requests come from mobile phones. It’s easy to imagine people looking for a vaccination slot while still sitting in front of the telly. If we don’t react, the volume of visitors will soon saturate the site.

The blue curve in the graph above shows the number of transactions that evening. It reached more than 10 million transactions per minute several times.

As a comparison, the green dotted line shows the Monday evening from the week before. One can clearly see that “something” is happening. Our system is able to detect it and to alert the SRE Team. This system triggered an alert at 20:04. Florian decided to call a crisis, which allows him to gather a team and get more help if needed. This is how I received the SMS at 20:13.

The graph below shows a regular Monday, with 1.89 million transactions per minute. On Monday, July 12, we logged peaks of 10.3 million transactions, a 5 time increase from the week before.

The SRE Team, led by Simoné and Florian, first applied procedures battle-tested when vaccination centers in Germany drove large amounts of traffic to our site. Florian announced he’d force an increase in server utilization, without waiting for the auto-scaler to kick in.

Fortunately our monolithic application (Ruby on Rails) is very robust and scales within minutes.

After a few minutes we saw that the measures were taking effect. But it would not be enough with more visitors coming in. Imagine a supermarket: you’ve just opened the doors and customers are rushing in. You make sure that everyone finds the items they need. Now people are going through the checkout and there is again restraint. That’s exactly the concern we’re going to have in a few minutes. We didn’t think of activating a cache system on the appointments, which would have taken the pressure off the read databases that were now continuously fetching real-time availability. In our defense: we had bigger fish to fry. And except for the (steep) hosting bill for the evening, this oversight would have very little lasting impact.

After increasing the number of applicative servers, we noticed that the database requires more and more resources. At Doctolib we use relational database PostgreSQL. Except that Doctolib is not a “classic” client. We use a system of replicated databases in the cloud. Our AWS provider offers Aurora technology, which makes it possible to copy data thousands of times, and share visitor cohorts on a cluster of servers.

We changed from 6 to 14 Aurora instances of type db.r5.24xlarge. Each instance represents 48 core, 96 vCPUs and has 768Gb of RAM. In order to correctly manage the write traffic of appointments, there are 1344 bases in writing. In case you were wondering, all the data is encrypted from end to end at rest and in transit. We use double encryption, which ensures that only the patient and the practitioner can share their information. This is a requirement for hosting health data.

We could have used fewer machines, by activating a cache which updates every 30 seconds on the availability of appointments. But alas, we forgot. The moral of this story is to document important procedures!

A positive outcome is that every user of the site benefitted from the most accurate vaccination slots reservation because all we had to go one was real time availability.

As a result of this evening, David Gageot, our Chief Architect, decided to take care of configuring a cache in case of future issues.

To understand why this is important: Doctolib allows you to choose your time slot for vaccination, based on real time availability. While this is a huge advantage for you, it’s very complicated to architect correctly. The system could have been written in a way that said “get in line on Monday at 8:30 and you get the next available slot”. However, this is not the service we offer to our users, instead we work with specific time slots. Changing this would mean adding a layer of complexity of our architecture. This choice of 2 slots requires the pre-booking of a slot for a few moments, while you take the time to find the 2nd vaccination slot.

20:20

At 20:20 there are way too many people on the site. It’s necessary to strongly reduce the number of new visitors. The team decides to activate a waiting room on our CDN. This kind of system is usually employed for ticket sales for the Olympic Games or the World Cup! We installed it for Doctolib Germany a few weeks prior, in order to handle critical increases in traffic.

20:22

One small problem: we don’t know yet what the right setting is, because what we’re experiencing is truly unique. A first threshold is set at 800.000 concurrent users on the site. We won't know if the restrictions work until a few minutes later.

From our monitoring tools we understand that people spend more time on the site than a typical visitor. On the database side, writing volumes are soaring. We understand that visitors are registering family members who are not yet on the platform. We learned the next day that the vast majority of the appointments made were by people under 30 years old.

20:26

Meanwhile, we see that the situation does not quiet down. At about 20:26 the number of confirmed appointments exceeds the ceiling of June 26, which served as our benchmark. The team decides to further reduce the number of concurrent visitors to 400.000. We figure that when users can complete their search without experiencing lag, they’ll finish their session quickly, making room for other visitors.

From the notes I gather that Simoné confirmed we went from 8.000 to 26.000 transactions per second because people could finish their reservations.

20:35

Philippe Vimard, our CTO, joins the virtual crisis room. At 20:35 he counts 18.000 reservations per minute. We’re at 75% of the load of the databases in writing, so there is still some margin. He gave us carte blanche and recommended that we leave as many servers as possible running, in anticipation of the rest of the evening.

At 20:36 we think the heavy lifting is done, and we suggest switching to a medium crisis status. This allows us to free up resources.

20:53

At about 20:53 we have a system that is saturated and no longer returns errors, which (ironically) is worrying. When we investigate the next day, it turns out that a partner thought they had suffered an attack, and mitigated the traffic for a period of time.

Nothing happens as expected and the traffic remains very steady for a long half hour. We remain in high crisis status, which is quite unique at Doctolib.

The traffic then seems to suddenly calm down for a few minutes. It is eerily quiet, a bit too quiet for our liking. Someone pipes up to say that it is the end of the President’s speech and that the commercial has started. And we know what will happen after the commercial, people will come back!

It’s the eye of the storm and we’re in the middle of it!

21:07

And indeed: as soon as the commercials end, we see another peak in traffic, and many (many!) new reservations a few minutes later. Simoné marks a peak of 30.000 requests per second at 21:07.

21:25

Around 21:25, Eric from the Security team checks in. He informs us that the site suffered a small but deliberate attack from a French IP address. The Security team successfully mitigated the attack, without interrupting our service, which just goes to tell you that attacks like that sadly are relatively common.

The next 2 hours the team continues to monitor activity and adjust waiting room settings. Except that traffic doesn’t really normalize until 23:30. Anticipating that traffic will spill to Tuesday, the SRE team leaves the many servers running. A costly, but no doubt necessary decision.

21:50

We managed this far, but we anticipate that the SMS sending platforms will still need several hours to plough through the millions of text messages. Nicolas de Nayer, VP of Engineering, Léo Lanzarotti, Product Director, and David Gageot, take over control until 1am. Their task is to prepare our systems for what’s bound to be the busiest Tuesday we’ve ever seen.

The numbers don’t disappoint: 926.000 appointments made in 2.5 hours on Monday evening, and another 1.4 million on Tuesday.

A unique experience

Florian and Simoné took most of the critical decisions, with the help of 14 other Doctolibers (including myself). There was a lot of excitement naturally, any individual action could affect the working of Doctolib. Our leaders were very supportive, but ultimately the level of autonomy our tech teams enjoy was what made everyone confident to act.

There are over 300 people on the Tech & Product side of Doctolib, out of the company’s 1900 employees. I joined Doctolib in March 2021 and what a ride it has been so far! My hope is that by writing more about what happens behind the scenes, you will consider joining us as well!

P.S : Thanks a lot to Floor Drees and Charlotte Feather for their help on the translation to English. The original article in French was also published here. Thanks to Alexandre Ignjatovic for his help and his ideas.

If you want more technical news, follow our journey through our docto-tech-life newsletter.

And if you want to join us in scaling a high traffic website and transforming the healthcare system, we are hiring talented developers to grow our tech and product team in France and Germany, feel free to have a look at the open positions.