In this article, you will learn how we upgrade safely a platform used daily by millions of users…
Doctolib is a service that became essential for millions of patients and hundreds of thousands of healthcare professionals, in France and Germany.
With more than 50 millions monthly visits, Doctolib’s infrastructure use free and open source technologies like the PostgreSQL database.
The Doctolib WebApp use PostgreSQL as its main transactional database engine. This architecture involves a primary instance, and several read only secondary instances. Our database contains several terabytes of data (that roughly equate to 8 millions picture on Instagram).
In order to benefit from the last features, like the logic replication or enhanced performances on parallel queries, available since the V10 of PostgreSQL, we had to proceed to an upgrade of the engine on several servers while staying in a defined maintenance time range, with a downtime limited to the maximum, in order to reduce the inconvenience for hospitals and medical practices.
With it’s 20 years of existence, PostgreSQL and its community provide all the tools and knowledge needed to upgrade a major version (9.6 to 11.6 in our case)
Regarding this operation, 3 questions arose:
- How long will take the upgrade on the primary and secondaries?
- Will this upgrade causes applicative, functional or performance related regressions?
- Will we be able to rollback during the maintenance operation if any issue arise?
Our infrastructure team has a philosophy that we apply during any operation that could impact the quality or the availability of our platform: “Reduce the risk by eliminating the unknown”.
This is what we applied as well while building the strategy for that upgrade to a new PostgreSQL version.
Preparing the upgrade
Measure the needed maintenance time-frame
Define the needed time frame for the upgrade was mandatory to be able to schedule the maintenance operation, and warn our users of the downtime. That’s why we cloned our production database server and tried the pg_upgrade procedure.
This method demonstrated good results, as it only took 15 seconds to upgrade our cloned instance.
It’s essential to use the analyze_new_cluster.sh script that is generated during the upgrade, because the table statistics are not copied while going from one version to the other. That script will execute a 3 steps vacuum analyze, in order to generate the new table statistics as fast as possible.
This table analysis phase took 20 minutes.
Regarding the secondary instances, we choose to proceed to a copy with the rsync utility, with the “link” mode. This allowed us to upgrade all the instances in a few seconds while keeping all the data, preventing us from having to go through a long and impactful rebuild process.
Load testing and non regression testing
It was essential to ensure that this new version was not introducing performance issues before pushing it to production. We launched a load testing session on our pre-production environment.
For all architecture changes with a high potential impact, we use Gatling and a set of scenarios that mimic the load we have in production (430k requests per minute).
In order to ensure that this new version does not introduce functional regressions, we deployed it for a full week on our staging environment. This platform is widely used by Doctolib’s teams to test new features and evolution of the platform.
Proceed to a major upgrade on the system at the very core of our platform pushed us to anticipate what could go wrong (data corruption or sync issue, for example), and what we could do if anything went sideways.
The safest solution was to prepare several secondary instances, and to not upgrade them, just to have backup servers with the previous version ready to take over.
In case of trouble, we would promote one of those secondary backup server as a new primary server, plugged the other secondary instances on it, and reconfigured the whole web application to use it as well.
This backup strategy gave us a good safety net to start the upgrade operation.
The upgrade of our databases infrastructure was a success, we managed to execute it in the scheduled time frame. There was no issues or unknown surprises, and we knew exactly how long would take any operation.
Even if not a silver bullet, our philosophy of reducing the risk by eliminating the unknown always helped us to reduce the amount of bad surprises we had.
This method is key to keeping an excellent uptime for our platform.