How we send millions of messages per day at Doctolib
For most Doctolib users — whether in France, Germany or Italy — the above message is a familiar sight. This unassuming piece of text encapsulates the core of Doctolib which started it all a little less than 10 years ago. Today, it empowers hundreds of thousands of medical professionals and millions of patients, allowing them to connect and meet every day.
The message referred to is simply known as an appointment reminder and is one of the many reminders sent daily as part of the approximately 8.5million daily transactions (which is constantly growing).
That said, let’s talk about how messages are managed at Doctolib!
What is a message?
Think of a message as any sort of communication sent from Doctolib directly to its users, and a message can travel through any of the following three channels: SMS, Email, and Push Notifications.
This work happens courtesy of our Delivery System, eventually reaching our providers which then finally relay our messages to our end users (medical professionals and patients).
Alongside the common fields such as content, recipient and sender, there are other fields the Delivery System requires to perform well. For us it’s not just about delivering the right message to the right person, but it’s also about making sure we select the right provider to achieve that.
What is a provider?
While we build and create messages internally, getting them to their respective users is done by third party services, or “providers”, as we like to call them. Simply put, we send our messages to the selected providers, which then send them to the users.
For push notifications, on the market, it could be OneSignal, Firebase. For emails, on the market, it could be Sendgrid, Mailjet…
How do you link a message to a provider?
Three fields come into play when linking a message to a provider:
- Channels: SMS, push notifications and emails.
- Country: A provider can be used for one country but not for another. Financial and legislation concerns play a role in this.
- Target capability: Two providers for the same channel could support different subtypes of message. For example, we can send text or voice SMS, and we need to have more than a single provider able to handle that.
For each channel, we rely on several providers. Some providers can be used for a specific country or a specific target capability. As communications are crucial at Doctolib, we did not choose to put all our eggs in the same basket, meaning that we rely on several providers per channel.
Another important facet of the system’s internal logic is being able to differentiate between urgent messages and lesser time-sensitive ones. Much like other, modern platforms, we make use of OTPs (One Time Passwords) which have a very short lifespan, and, consequently, must be relayed immediately. On the other hand, reminders are sent when available resources are freed up.
As such we have two priorities within Doctolib (for communication):
- HIGH : Synchronous message, mainly used for OTP
- LOW : Asynchronous messages.
For example, let’s say you want to send an OTP related email to a user in France. Here’s what it could look like:
- Priority: HIGH
- Country: FR
- Target capability: SECURE
- Channel : EMAIL
The Delivery System will select one of the providers that fulfills the above requirements. Finally, because it is a HIGH priority message, it will be sent on the spot.
High level architecture diagram
At Doctolib our backend is framed by Ruby on Rails, which we’ve split into engines for each subdomain, but also for wider infrastructure components.
Our Delivery System is one of the infrastructure components and it exposes a public API.
That public API can be called by the different feature teams from their Rails controller or also from their Rails scheduled jobs. One endpoint is exposed for each channel.
According to the priority the feature team sets when calling our API, our engine computes whether to send the message synchronously or to queue it and let one of the channel specific scheduled jobs send it later.
Regardless of the selected priority, all messages must successfully make it through the DeliverySystem service which not only ensures that the appropriate provider is selected for each message, but also that every message is properly formatted for each provider (as each provider has its own way of ingesting messages).
In addition to all of the above, we run an admin page that allows us to configure our load balancing across every channel, by target capability and country. For example, we can allocate 20% of emails to a provider, 50% to another, and then finally 30% to a third, and we can do all of that by country (France, Italy, Germany). Quite handy.
How do we handle events?
Events represent the various important milestones in a message’s journey.
Here’s what the typical life cycle of a message looks like:
- A Feature Team wants to send a message
- The Delivery System receives the request and sends it to one of our providers. It creates either a “sent” or a “lost message” event.
- The provider receives the message and hands it down to the relevant operator and notifies our servers, in response to that we create either a “delivered” or “provider error” event (for the relevant message).
- The operator receives the message and hands it down to its final recipient. The provider notifies us, in response to which we insert a new event for that message: either “received” or “delivery failed”.
Keeping track of these events for our messages is crucial for building our monitoring system and alerts, importantly, it also allows us to handle failures.
How to handle failures ?
Not a single platform can be reliable 100% of the time. From time to time, a provider can have a transient issue that could be solved by just replaying the HTTP request. But sometimes, the issue is more serious: one of our providers could be down for a longer time and we cannot continue to send communication through them.
To account for that we built a mechanism that alerts us when something goes wrong based on our load balancing configuration. In other words, when a provider doesn’t send the expected number of messages at a specific rate, we receive an alert via email and Slack, and have to manually adjust the load balance to offset the damage and redirect all messages through healthy providers.
This solution was not ideal because:
- We relied on manual operation, so it was slow and prone to error.
- Lost messages were lost forever.
We therefore introduce 2 news mechanism to improve our overall deliverability:
Autopilot !
That’s how we named the scheduled job responsible for doing health check analysis on our providers!
It runs regularly and its purpose is to analyze the number of transmission errors returning from our providers. If the rate of transmission_error reaches a predefined threshold, the Autopilot kicks in and automatically adjusts the load balancing ratio, spreading the workload accordingly across healthy providers.
In addition, when a provider is fully deactivated, the Autopilot regularly pings the defective provider. When the defective provider is up again, the Autopilot will progressively increase traffic flow to that provider again. No more slow, manual adjustments.
The autopilot has significantly boosted our confidence in the communication engine and enables us to reduce noise and avert internal crises.
Wider retry mechanism
In the past, failing requests were retried up to 3 times but were eventually lost if the problem persisted.
Now when a message cannot be sent for some reason, it is enqueued again. Eventually the corresponding scheduled job reads from the QueuedMessage table and attempts to send the message to a provider again. Thanks to the Autopilot, the probability of sending the message to a reliable provider increases over time, and the deliverability of our communication increases as well.
Furthermore, in order to avoid retrying indefinitely or even sending a message that’s been stuck for a while and that’s no longer relevant, we introduced one more column in our QueuedMessage table:
- number_of_retries
A communication is now considered as lost when the number_of_retries is greater than a predefined threshold.
This retry mechanism is, for now, only working for our low priority message, since we do not want to send multiple times the same message to our users. Therefore, we will soon introduce an expired_at column to avoid sending irrelevant messages anymore (there is no point in reminding you about an appointment that happened in the past) and make our retry mechanism cover 100% of our communications.
What’s next at Doctolib ?
We have plenty of ideas to improve our Delivery system. As our CEO says, it is only the beginning and we work hard to propose better solutions to reach the end goal of the Delivery System: deliver all the messages within a reasonable time.
Here is an overview of our roadmap to improve the Delivery System at Doctolib:
- Stop sending high priority messages synchronously. Why? We don’t want to have a synchronous call to an external provider within the life cycle of a HTTP request.
- Introduce new priority levels: OTP are the most important messages to send and reminders could be sent when there is bandwidth to handle them. A refined priority hierarchy will give us more granular control over the order in which we want to send our communications.
- Using a queuing mechanism and being able to dispatch the workload by channels, priorities to a consumer pool.
Thank you for reading, and if you found this article interesting, stay tuned for the next one!
Cheers!
The purpose of the article is to present and share the work done by Doctolib’s tech team. The information contained in this article is provided for information purpose only (on an “as is” basis with no guarantees of completeness or accuracy) and does not constitute any legal advice, nor has a legal value. Therefore, it could not contradict in any manner whatsoever with any legal binding terms applicable to your relation with Doctolib.