Maintaining App Performance at Doctolib, an overview

Coralie Collignon
Doctolib
Published in
7 min readMar 31, 2020

Implications, rituals and tools

With more than 50 million visitors each month, maintaining Doctolib application performance is a crucial stake. Not only thousands of patients take appointments each day, but Doctolib is also a work tool for doctors. Service has to be up without downtime and scale to handle traffic.

First, let’s define “performance”. Two sets of performance metrics are closely monitored.

The first metric we look at is the response time. Response time is the amount of time from the moment a user’s browser sends a request until the time that the application indicates that the request has completed. At Doctolib, a golden rule has been set for the response time to be under 70ms. The faster the response time the better we can serve our thousands of requests.

The second metric we look at is the load (volume of transactions processed by the application), more precisely we look at the requests per minute (RPM). The RPM is simply the number of requests in one minute. Our platform can reach peaks up to 350K rpm.

Queries on the database are also a matter of importance as users expect fast responses on their data retrieval actions and are waiting for the page to display information.

In the meantime, the platform stability and user experience is constantly challenged by our team of developers. Our platform is a Ruby on Rails monolith. More than 70 developers commit daily to the codebase, representing 200 to 300 pull requests (PR) per week. The monolith is deployed everyday. This constant solicitation of the codebase impacts the application stability.

How to ensure that the app performance is not degraded over time ?

We monitor performance at all time. This culture of maintaining performance starts from the moment a Doctoliber joins the adventure to the day to day monitoring. This is how we proceed:

  1. we train and raise awareness on the main degradations that could hit the code
  2. we monitor degradations daily
  3. we investigate highest degradations and fix them
  4. we use tools in development to anticipate production degradation

Here you will find an overview of what is implemented at Doctolib. More will come in upcoming articles.

Step 1: Train and raise awareness….

Tech camp

When joining Doctolib, each developer is onboarded with a “Tech Camp” to train them on many aspects of the job. That includes infrastructure, pair programming, rituals and application performance among others.

The performance tech camp is a practical workshop, with seven exercises on actual degradations encountered in production. Each exercise consists in a method with poor performances and whose response time should be improved. It covers both database (the N+1 queries, the use of indexes and bad query plans) and ruby code (impact of logs, of too much instantiation, of loading of a full object) caveats.

Let’s take the example of logs’ impact on performance.
From Rails documentation,

“logging will always have a small impact on the performance of your Rails app, particularly when logging to disk.[…] Another potential pitfall is too many calls to Logger in your code.

logger.debug “Person attributes hash: #{@person.attributes.inspect}”

In the above example, there will be a performance impact […]. The reason is that Ruby has to evaluate these strings, which includes instantiating the somewhat heavy String object and interpolating the variables.”

At the end of the Tech camp workshop, we have some take-aways, as the one shown above, and we know more about what affects our code performance.

Step 2: Monitor daily…

Duty dev performance check

When onsite, the duty dashboard is displayed to communicate on the current state of master and production branches (Continuous Integration) and staging environment (last deployment).

Doctolib Dashboard

Each morning at 10am, the main degradations that happened between 8 to 10am (compared to 8am to 10am 7 days ago) are fetched from New Relic. They are ranked by biggest degradation (in milliseconds) and displayed on the dashboard.

For example,

[vc] Controller/api/video_chat_events/create +24.4% (+6.1ms) | for 2 days

This is a degradation on Api::VideoChatEvents#create endpoint. It has been degrading for the past two days, the response time increased by 6.1ms (representing a 24.4% degradation). The team in charge is vc, which means Virtual Care team.

So now that we are aware of the main degradations, who will handle this?

Each day, a different team member (the “Duty Dev”) is responsible for the application rollout to production. They are responsible (among others) to check these main degradations on New Relic. Their job is to understand each degradation to ensure it is not a progressive deterioration: if there is a persistent degradation or a doubt of a persistent degradation, their role is to communicate it to the team in charge.

At the end of the day, the duty dev shares the report to the whole tech team so everybody can access the latest information about the rollout.

Step 3: Investigate further…

Once the team in charge is aware of a degradation, a team member investigates further, at the Ruby on Rails application’s level, the degraded endpoint and tries to improve its performance.

Here are different useful tools to investigate:

New Relic

New Relic is a software for application performance monitoring (APM). It is a powerful tool to investigate further a degradation as it provides both current and historical information (database query performance, web browser rendering performance and many other useful metrics). It helps us analyze and manage application performance constantly, as it gives detailed information at transaction level.

Server Logs

When investigating locally, rails server logs are very useful. When an endpoint is triggered, some data are displays such as:

Completed 200 OK in 7890ms (Views: 1217.3ms | ActiveRecord: 185.1ms | Allocations: 4374234)

It gives a reference to start improving the performance of the endpoint.

Flamegraph

Flamegraph is a gem that displays stack trace execution. It is helpful to know about deep stack calls or methods that take time for example.

Visualisation of Flamegraph

Rack Mini-Profiler

Rack mini profiler is a middleware that displays a speed badge for every html page. It gives useful information about the response time but also about the number of queries executed on the database. In this example, there are 4 SQL queries done when hitting the index endpoint.

Rack mini-profiler

Perf Bounty days

We also have pop-up initiatives to improve performance. A two-day workshop called Perf Bounty days was organized in November 2019 to tackle some degradations. Ten developers gathered to work together specifically on these endpoints. The goal was to go below the golden rule for the load of Doctolib database servers. That means that we had to go below 50% of load on the primary and on the secondaries. The workshop was a success as the changes had a visible impact.

Step 4: Prevent degradation before production…

Degradations that we find in production are found a bit late. As much as possible, we try to catch them before going to production.

Bullet

Bullet gem is designed to help increase an application’s performance by reducing the number of queries it makes. It watches your queries while you develop your application and notify you when you should add eager loading (N+1 queries), when you’re using eager loading that isn’t necessary and when you should use counter cache.

At development level, Bullet is useful in logs, as it directly points to a degradation. It indicates below that we load more than we should and advises to remove includes from the query:

GET /api/accounts
AVOID eager loading detected
ExternalSync::Configuration => [:external_sync_connector]
Remove from your query: .includes([:external_sync_connector])
Call stack

Pull Request review

PR review is a way to challenge the code that is going to production. Reviewers challenge SQL queries and code implementation knowing that performance is key.

GOTCHA Bot

We created a bot called GOTCHA that checks the changed files in all created PRs and posts a comment to warn the author about potential performance issues it finds. For performance check, two GOTCHAs are set.

One GOTCHA warns us about long running transactions. As some operations on PG are blocked or waiting if a transaction is opened, so migrations during rollout can be stuck.

And another GOTCHA warns us on running SQL queries on the secondary databases. By default, every SQL query is sent to the primary database. The standby gem can route SQL queries to the secondary (read-only) databases. This is useful to offload the primary database from expensive read-only queries.

The GOTCHAs are informative, not blockers. They help us to double check what we implement in our PRs.

Conclusion

What I have been covering in this article is the rituals we implemented at the dev team level. We work together with the DevOps team. They have their own rituals and tools that I haven’t covered here.
We try hard to prepare each Doctoliber to be familiar with performance impacts. This goes with training, monitoring and proper tooling.

Long story short, as a dev team, we do our best to maintain performance of Doctolib so the platform can scale.

What’s next?

If you want to learn more about our tech team we write a weekly newsletter, you can sign up for here. And if you want to join us, we are hiring!

And special thanks to the Doctolibers who took the time to read and give me feedback on this article !

--

--