Increase operational coverage with Skills Matrices

Maxime Fouilleul
BlaBlaCar
Published in
7 min readOct 28, 2022

“Skills Matrix” is the kind of buzzword you would pull out from the management toolbox. It promises better team performance, project management, prioritization, etc. The theory is simple: It brings transparency to the competence coverage in a given scope for a given team. But in practice, how do you create and use the Skills Matrix as a tool to increase operational coverage and perform better as a team?

In this article, I am sharing the BlaBlaCar Databases Reliability Engineers (DBRE) team’s experience with this topic. Our mission is “to provide reliable services and expertise that enable BlaBlaCar engineers to master their databases at scale in an easy and safe way”. To do so, we need to package and support a set of database softwares to be used by the BlaBlaCar backend teams. Packaging a database product is a one or two engineer(s) driven project, but once released, the implementation becomes a “team responsibility” and our users expect support from the whole DBRE team on a 24/7 basis, no matter who’s available, whether the engineer(s) who packaged/developed the product is available or not.

Let’s simplify the problem in one illustration:

And for those who prefer bullet points:

ℹ️ Every team is responsible for a given scope

ℹ️ That scope surely requires different skills and knowledge to be supported

ℹ️ We often get to a point where skills are spread among individuals (sub-scope owners)

🚌 From time to time the staffing will not match the need

😨 Despite this, the team remains responsible for the given scope

🤔 How to ensure that we can handle a good level of support as a team?

The Skills Matrix

The main outcome of a Skills Matrix is ​​that it highlights the “team weaknesses”, while not finger-pointing at individual responsibilities, allowing the team to make decisions and take action to (pro) actively fill the gaps. The result is an increase in confidence and overall team performance in its scope.

Here is the DBRE Level 1 Skills Matrix. We can easily spot what areas are ok or at risk for our scope according to our current staffing.

For example, we can highlight that MariaDB or Elasticsearch coverage is full (4/4) but PrepareClusterBoostrap for Kafka is not mastered as the competence is limited to one person:

DBRE Skill Matrix — Level 1

Ok, so how to create it?

One key element is that building a Skills Matrix should be a team effort. I voluntarily used the word “weaknesses” in the above section, and I am pretty confident to say that if a pure top-down initiative is done to “reveal team weaknesses” the buy-in will be challenging.

The Skills Matrix, once done, will be used throughout the daily life of the team: to prioritize learning sessions and ownership handover projects, set team habits (pair programming, mobbing…), influence staffing plan, etc. It must be the result of collective work with strong commitment from the entire team.

Ok, let’s go.

Step 1 — Meet with the team for a workshop session

The first step, therefore, consisted of bringing every teammate together for a workshop, pitching the objectives, and starting to brainstorm.

Step 2 — Define scopes/components supported by the team

Before talking about skills, we need to define what we are talking about. What is the scope of the matrix? What components/topics do we need to be good at?

For the DBRE team, we decided to choose Database products as components:

  • Cassandra
  • CloudSQL — MySQL
  • CloudSQL — PostgreSQL
  • Elasticsearch
  • Kafka (brokers)
  • MariaDB
  • MemoryStore — Redis
  • RabbitMQ

I recommend being iterative as the scope may/will expand over time. The list above is missing some of the DBRE’s supported software that we are using today… Starting small will bring more confidence to the team while discussing the remediation strategy. The goal is not to be perfect, but to be better. 💪

Step 3 — List key actions that should be mastered for each component

Once the iteration scope was clarified, we discussed and decided which “skills” define our jobs out of all these components.

We tried to be generic to allow us to put the result in a two-dimensional array (skills/components). Sometimes this simplification is not possible, for example, when actions/responsibilities related to one component are not applicable to one another. This is not a blocker and I will show you examples at the end of the article that worked perfectly with such heterogeneity.

In the table below you see our DBRE “skills” defined with a Level, Name, and Description:

Level 1, Level Owner? We decided to divide our skills into two levels and set different team expectations for each:

  • 🏅 Level 1 actions are basic tasks that should be mastered by each team member. The purpose is first to answer as quickly as possible to our users (ping on Slack, Jira Task, on-call questions, etc.). Having these actions mastered empowers team members, increases confidence and ownership (more comfortable in handling the run tasks, reviewing pull requests, having a chat with developers, etc.). Most of those activities must be mastered to be on-call.
  • 🎖 Owner actions allow the component to be actively supported and should be mastered by at least 2 team members. The tasks are related to improving packaging, tooling, providing advisory, and mastering particularly risky tasks, such as data recovery.
    Those activities should not be needed during the on-call period as they won’t be mastered by all the team members.

4. Time to assess!

Now that we have our components and skills, we let our engineers say if they are confident or not in doing “that skill for that component”:

Note: Beware to include the local context in the assessment, for instance, an engineer would be ok to validate “Connect&Read” in “MariaDB” in his previous job but not in our platform. We don’t try to assess the engineer’s competency, but her/his capacity to operate the stack in BlaBlaCar’s context

And this is it.

Once computed we have a working document that highlights our safe areas and weaknesses:

Level 1

Level Owner

What is next?

For Level 1, the objective is to have all engineers onboarded, at DBRE we decided to set up a weekly 1h30 slot to exchange and enforce team skills in the needed areas.

Every Wednesday morning we have our now-famous DBRE Classroom which is used to exchange, share screens, write down run-books, explore upstream documentation, etc. anything that can lead to turning yellow cells green in our Skills Matrix!

For Level Owner, the objective is to have at least two engineers onboarded to avoid having a “Single Point of Failure” on a domain.

We implemented a more complete model than just regular meetings as we do for Level 1. This model is a real learning program requiring a strong commitment of the current “Owner” and a “New Owner”. Here are the key elements of the program:

📚 “Owner” selects upstream learning materials (videos, articles, certifications, etc.)

✅ Engineering Manager validates the selection (team bandwidth, priorities, budget, etc.)

🤓 “New Owner” watches/reads the selected content to have a better tech background on the scope

👯 “Owner” and “New Owner” meet in 1:1 every week for 1-hour exchange on the scope

👯 “New Owner” and “Owner” are pairing on-run tasks for the scope

💪 “New Owner” takes the next roadmap projects on complex tasks for the scope

Leverage the Skills Matrix in your daily team life

As an SRE team, the quote “what gets measured gets improved” makes a lot of sense to us. Exposing the status of our operational coverage as metrics allowed us to implement small changes that boosted our confidence in several domains:

  • Documentation: Having the list of components you are responsible for and the associated skills to operate them helped us to structure our runbooks to be as exhaustive as possible.
  • Mean Time To Response: After working to de-risk cells in the Skills Matrix, more engineers have the knowledge, which increases the chances of making a good analysis and having the right reflex in front of a given situation.
  • Better team balance: Highlighting and correcting the lack of redundancy (especially for the “Level Owner” skills) allowed us to reduce the pressure on some engineers, offering them time for other subjects, which is virtuous for the balance of the team’s professional and personal development.

Will it work for you?

As I mentioned in the intro, this is not a BlaBlaCar DBRE creation. Skills Matrices have been used for a long time in many companies and they can surely be useful to help develop your team’s skills too.

I hope this practical case was helpful and will inspire you to try and adopt this tool in your context!

Special thanks to Nicolas Salvy, Guillaume Wuip, and Ricardo Lage for the review!

--

--