Fighting bugs at BlaBlaCar
We are trying to create the best experience for our members on an ecosystem of various products: long and short distance carpooling, insurance for drivers and passengers, and car leasing.
Over the years we gathered a great community of 65+ million users that is providing rich feedback regarding their journeys with us. On the engineering side, we are very concerned about the quality of service we provide to our users.
We have multiple processes to ensure quality on our software :
- Dogfooding for employees (nearly all of us are using the platform)
- A great QA team
- Canary deployments
- Decent test suites
Unfortunately this is not enough.
Issues reported by our community are gathered by our beloved Member Relations Team. They do an outstanding job analyzing all these issues and providing insights to the Engineering team regarding the most recurring and painful issues.
A few months ago, we decided to do a major change on how we handle all these topics in the tech team. We learned a lot from this experience and wanted to share what worked and what did not.
Before going into details, let’s see what motivated this change. More than two years ago, we had around 60 to 80 developers split into product teams, working on a PHP monolith. It eventually reached the milestone of 1 million lines of code, 65k+ commits, 170+ different contributors and 10 releases per day.
We basically implemented everything you can think of into this PHP monster: APIs, Web apps, backoffices, marketing tools, documentations, and landing pages etc.
The monolith was getting less and less future-proof for various reasons:
- Poor code isolation, introducing many side effects
- For a small change, the whole platform needed to be deployed
- Lack of true ownership
- Lack of good Backend connectors (Cassandra, Kafka, Protobuf…)
and we started to look for solutions.
As you can guess, we chose Services Oriented Architecture as a long term goal. The plan to get there was pretty clear. The first service teams were created and started to break the monolith piece by piece. These teams were focusing on building the platform’s future with performance, reliability and strong ownership in mind.
Over time, we allocated more and more human resources to these teams to make things move as fast as possible.
In 2017 nearly every engineer left their product roadmap duties to work on this technical migration, resulting in poor ownership of daily releases and lots of other problems:
- Lots of bugs were introduced by the new features or technical changes. And it’s not always obvious to find what introduced the issue in the first place.
- A large number of existing bugs, too small to be done or too complex to solve.
- Some scaling issues, especially on the data side.
- A poor reliability due to design issues and no one to fix it.
- Lots of legacy code and outdated business rules.
This monolith was already considered outdated and we thought we could just let it die slowly. But it was still handling most of our business. This looks fine on the tech side since engineers were working full time on shaping the future, but it was not really suitable for our members and for the rest of the company.
We decided to build a task force to respond to this situation. A group of individuals that would take care of the following topics:
- Backend and frontend bugs that are out of scope of service and other teams
- Product/Business rules simplification
- Tech evolution of the monolith
As you can guess, it’s not easy to find engineers volunteering for this kind of work as it is often frustrating and/or requires several weeks of work to solve an issue. This was at first a temporary solution until we could really drop the monolith. We wanted to have full-time engineers on the team so they could gather some expertise and have a long term vision for the monolith, but there was not enough volunteers to fill our needs. We gathered help from all the engineers that previously worked on the monolith, requesting their time turn by turn.
The team began with one manager, two full-time backend engineers and three part-time engineers (2 backend, 1 frontend) borrowed from other teams for 2-week periods. To kickstart the team, we reviewed all the unresolved bug tickets and created a backlog. The full time engineers also gathered all the potential evolutions into a long term roadmap. Theoretically, this way of working was promising:
- The full-time developers would focus on long term, with tech projects. They would also assist the team providing tools and methods to work faster.
- The part-time developers would handle the run/bug backlog and respond to critical production issues.
The team has been running since end of 2017, and it was pretty efficient. We adjusted a few details during this period but nothing major. The team has been working with very light processes: a slack channel, a Kanban board and a daily standup.
We chose to have a dedicated space in the office for the whole team, thinking that team communication would be better and that the part time engineers would be better focused on their tasks.
Over time, we added a weekly meeting between the full time engineers and the manager to review all the news issues, with the following questions in mind:
- Is it reproducible?
- Is it relevant ?
- Does it belong to the team?
- Does the criticality make sense?
- Is it really important?
This has been set up to make sure part-time engineers were focused on the most critical tasks and not wasting time on asking informations or investigating non existing bugs.
Results & learnings
So far, results are pretty encouraging. The backlog significantly reduced and the overall quality of our service has improved. We could share some numbers but this would be pure vanity since it’s not relevant to other companies.
This way of working has taught us a lot. Here are the most important points :
Having an operational team is awesome
The team is in the front line in case of incident. From an employee perspective, having someone that acknowledge and investigate critical issues is healthy for everyone. This may sound stupid, but the simple fact of saying ‘Yes, we saw your ticket, we will have a look’ is really reassuring, even if the issue is not corrected right away.
By receiving most of the issues, this team became the most accurate to evaluate issues/tasks criticality.
Full time people is a real benefit …
They gather a real expertise on the platform. It means they have an overview of the platform quality and can develop the most relevant tools to take it to the next step. It also allow them to detect complex issues and pitfalls.
Becoming real specialists, they are able to help other teams in case of any need.
Following all the issues, they also gather a really sane overview of the platform. This is not about how medium-driven your technical stack is, not about how good is your code coverage. It’s about how you deliver value to the community. They know what is working, what is not and overall what is painful for day to day operations.
… but be careful regarding focus
The fact that they are full-timers mean they are identified as the go to persons, which mean they will get a lot of pressure from the outside. It’s hard to focus on anything but short term.
Moreover, as the most experienced, they are the most qualified to investigate production issues and will be called first in case of emergency.
Fixing bugs means more bugs
After a few months, the amount of incoming issues went up : our support team started to report more issues since we were allocating more bandwidth to them.
Everything cannot be fixed
At some point we realized that our backlog would never entirely disappear and that’s fine. During 10 months, we encountered various scenarios that are not always simple to handle:
- Some tasks are not worth to be fixed, because they won’t be relevant in close future, or represents too much work for the low criticality.
- Some bugs cannot be fixed by design.
- Some issues can be really tough. Every engineer that tries to fix them get stuck in the legacy and fail at some point.
It’s ok not to fix everything. It’s better to refuse tickets than to let them die in your backlog, because at least people get an explanation for that.
The rotation system is a good experiment …
It spreads the pain on all the engineering team which is obviously a sane way to work : you break it, you fix it. All engineers realize what our members are experiencing, how legacy systems are behaving and what really matters on the operational side of things.
On the human side, the part-time developers had various feedback
- It allows to work and share time with different people
- It’s a great way to take a break from their regular job
- It opens spirit on various topics
- It helps to get a bigger picture
- It also allows a better communication between teams.
… but can be hard to manage
Rotations does affect the other teams and their delivery capacity. And over time, we lost some resources in the rotation pool due to a normal people turnover (since we were only considering people that worked on our monolith). At one point, some team had to provide an engineer for every rotation, taking 20% of their time. To fix this, we thought about hiring contractors but most of the time they don’t suit our needs: they are not used to deal with production environments and all the performance aspects.
Handing over issues between rotations is also a complex topic. When a developper cannot finish his work on an issue, it’s really important that he leave some clear information behind him for the next one to take over. At the beginning, it was not really effective !
On the manager side, it’s pretty tough to balance long term projects and production issues; we always tend to promote short terms emergencies instead of long terms improvements. We ended up using Scrum instead of Kanban to have some sprint goals. That helped a lot focusing on long-term.
Another difficult point is handling the agenda of rotation members. It’s actually pretty tough to defocus one engineer from his main work. You have to deal with holidays, trainings, important meetings, last minute out of office… and the engineer should be warned soon enough so they can organize within their teams.
It’s not always great for developers
It’s not fun to work only on bugs : they spend a lot of time understanding the issue, trying to reproduce, investigating it and finding a solution. These task can get very frustrating, especially when the person is not used to it. And sometimes, it’s just that the task is boring. Moreover, most developers are problem solvers and are really affected by their individual performance at closing issues. If they don’t succeed, they loose their motivation !
During their daily life work on other projects, the tasks have nothing to do with normal work, that introduce some context switching which is not easy to manage. That said, some developers consider that as a way to emancipate from their roadmap for a few days.
It does not solve production health issue
By production health, we mean less critical issues that you can witness yourself on the platform or discover on the monitoring tools. It could be some performance issues, some imperfect implementations etc etc.
This could be a good approach to a better quality of service and sometimes allow you to find problems before they get visible for the users.
But this is be really time consuming without the proper tooling, especially on a large codebase. That said, the long term goals of the team include some monitoring and performance improvements.
What it looks like today
The team is still performing well and the bug backlog is now really reasonable. We have now only one part-time engineer left, and this is enough to us. Our monolith has a better stability and a smaller codebase, which allow us to consider it as a service like any other in the company. The team is responsible of its operation and evolution.
This article follows the path of Handling bugs at Doctolib written by Jessy Bernal, which present another organization that could totally work for you ;)