Incident Management at BlaBlaCar

Published in

BlaBlaCar

6 min readNov 29, 2021

How does BlaBlaCar manage an ever-increasing number of members and the technical challenges that come with it?

What is it?

Incident management is about being efficient when resolving incidents. That means dedicating enough technical and human resources while not wasting them — having too many people involved often results in less efficiency — and keeping incident response’s pace to ensure the quickest resolution.

Why is it important?

Because when scaling up, the product gets richer, offering more features, hence becoming more complex, while facing an exponential number of end-users adding more and more pressure on the apps & infrastructures. And if you choose to build a Services Oriented Architecture (SOA) the number of services will also increase. With the tech team scaling accordingly to support the product expansion, you’re likely to get more teams and to spread services’ ownership across them.

Scaling efficiently a company requires to implement more synchronization to keep everyone on the same page. And synchronization is especially key during Incident Response.

What did we do at BlaBlaCar?

At BlaBlaCar, we have been applying a simple principle for a few years now: “You build it? You run it!”.

As a consequence, we moved from a unique on-call rotation operated by the Foundations team to multiple ones.

All teams running critical production applications or components (from infrastructure to phone apps) have to organise an on-call rotation on their perimeter, to ensure a 24x7 availability. Considering the tech team size, we have 15+ on-call people at the same time. Despite a Service Oriented Architecture, dependencies between components are frequent, leading to a serious need for synchronisation when facing an incident.

So, we need to get organized. But how? And up to which point?

As a strong requirement, we consider that processes should always be implemented to provide support and to accelerate incident resolutions not to slow them down.

To ensure that we are improving our processes, we need to implement KPIs to measure current incident management performance and then decide whether to take action or not.

Making this decision will highly depend on your current performance facing incidents, your business needs, and the cost to get to the next level. Incident management should not be a dogmatic framework but a product & tech feature.

Define KPIs

When talking about Incidents Management, the main KPIs we use are:

Number of incidents: quite obvious, but it’s still good to keep an eye on it. As step two, adding a severity level (and ensuring the quality of this severity assignment over time) will make this indicator more meaningful.
MTTA: Mean Time to Acknowledge
Average time between when an incident is triggered and it is acknowledged by a user
MTTR: Mean Time to Recover (or Resolution)
Average time between when an incident is triggered and when is it resolved

Anticipate

“Hope for the best, plan for the worst”

Despite the strength of the saying, it’s lacking one crucial piece of information: how deep should we plan? Well, not too deep. The deeper you go, the less you’ll be able to quickly adapt.

“We need a plan, we also need a backup plan, and we need to have an idea of what to do when both plans fail”

This quote is much better, in my opinion. It shows that we need a plan to start with, but that we’ll need to complete it. Let’s call this first plan ‘on-call rotation’.

On-call rotation is here to ensure we have by design available and competent resources to fix issues whenever they arise.

However, we also need a backup plan because, well, things tend to never go the way we expect them to go, right? Let’s call it ‘runbooks’.

Runbooks are pieces of documentation that describe (more or less) simple known issues, and the way to solve them.

But… What if on-call people can’t solve the issue on their own? What if no runbook allows them to fix it?

Level up your organisation

That’s where we need an idea of what to do: get organized to efficiently face situations where plans A and B would have already failed.

To face this kind of situation, we worked at different levels of the organisation:

at Team level: a contact in each and every team owning a production service. It’s the on-call person we just talked about
at Engineering level: a process to describe the global organization when dealing with an incident that can’t be solved at the team level.

At BlaBlaCar, we identified the following potential roles:

Incident Commander: his responsibility is to help drive the incident to resolution. He can be helped by a Deputy, a Scribe and a Communication Manager during major incidents.
Subject Matter Experts: they’re the on-call people. They’re in charge of troubleshooting what’s going on and finding the quickest and best solution to restore the service.

These are the 2 primary roles you should consider to enable when facing an incident.

Depending on the severity of the incident, you can also consider enabling secondary roles to support the Incident Commander:

Deputy: he’ll be a direct support to the Incident Commander, keeping an eye on the timer and being a potential “hot standby” Incident Manager.
Scribe: records Incident Response minutes and timeline. It can be done by Incident Commander himself or Deputy, but need to be a dedicated people in case of a major incident
Communications manager: he’s responsible for providing visibility about the incident response to our internal stakeholders.

The process doesn’t need to be too precise. But you have to identify roles and responsibilities in order to:

Allow Subject Matter Experts to focus on restoring the service, being preserved from stakeholders' legitimate requests.
Ensure there is no hole or duplicate in the above roles.
It’s fine to have multiple Technical experts involved, especially for major incidents.
Incident Commander ensures only 1 Technical Expert is assigned to a specific task

During incident management: roles and responsibilities have to be crystal clear.

There is no hope without this.

Incident Response workflow

Now that we have an organization, how do we articulate it to help go toward incident resolution?

A first workflow to implement can be as simple as:

Assess the incident severity. This will help define the roles to be enabled. The severity can be reassessed later.
Then, when Subject Matter Experts, and possibly other roles, have joined the incident response, we are able to start iterating:

Size-up: it’s about identifying symptoms and incident scope.
This is where Incident Commander can reassess incident severity and onboard (or offboard) Subject Matter Experts depending on the need.
Stabilize: it’s all about taking actions to solve the issue
Update: communicate about the actions we’ve decided to take
Verify: follow-up on task completion

We follow this simple workflow until the incident is solved which can be assessed during a Verify step. Are all SLOs and metrics’ graphs coming back to their normal state? Are the alerts being resolved?
If answers are positive, the end of the incident is notified by the Incident Commander.

After the incident

Now that the incident is over, is it time to go back to developing features ? Not quite, actually.

We’re missing the final step, and the most important one: delivering a post-mortem.

Writing a post-mortem allows for continuous improvement. It’s an opportunity to step back, have a global look at the incident, identify what went well and what went wrong.

It’s a matter of discovering what we learned about our services, architecture, and organization as a team.

Finally, it’s time to decide on improvements and share everything internally so that other teams can learn as well and, maybe, anticipate future incidents.

But this is quite another story and it deserves a full article on its own. You can read the nice article from Nicolas Beytout about how our Product team leverages failure to understand why continuous improvement is worth it.