The Duty Guy: the key to empowering engineers

Nicolas De Nayer
Nov 21, 2017 · 6 min read
© Illustration by Bailey McGinn

It all began two years ago when the Doctolib tech team started to grow. We started as four engineers, and now we are nearing forty. We set up a process so every day an engineer would be on duty for the whole team. Thanks to him, everyone else can stay focused since the Duty Guy will take over all the small obstacles of the day and be in charge of releasing the new features at the end of the day — But there is more to it than that!


In the beginning the two founders were in charge of every step of the product development. As a part of this, they had control over all the daily pushes to production. However, this had harmful effects on the developers, as they knew that they had a safety net, and their mistakes would be ultimately overhauled by the founders. The two founders stayed everyday after 8pm to double check every new line of code, wait for the traffic to calm down, revert code in case of problem, and so on.

Start with the rollout…

We decided everyone on the team would have the responsibility to push new code to production. To do so we would have a guy on duty.

It is a rotating role, which means, everyday a new developer is in charge of pushing the button to production.

The first process of the Duty Guy was pretty simple and focused on the rollout:

if !today_rollout.done? && is_it_after_4pm?
follow_checklist(:production_rollout)
end

…and iterate!

It worked very well; after the first couple of weeks everyone had a chance to take part in a roll out and after a month or two the process became routine. The duty role has evolved quite a lot since then. We even have an algorithm for it now:

Disclaimer: one should know that we use sentry for error tracking to monitor and fix crashes.

begin
while self.duty? do
# Keep master green
while continuous_integration_status == 'fail' do
follow_checklist(:troubleshoot_continuous_integration)
end
# Ensure no critical errors will be released
staging.sentries.each do |sentry|
if sentry.created_at > 2.days.ago
owner = identify_the_most_adequate_person
owner.poke!
sentry.assign(owner)
else
conclusions = investigate(sentry)
Jira.create!(conclusions)
end
end
# Ensure no poop has been released
production.sentries.unassigned.each do |sentry|
owner = identify_the_most_adequate_person
owner.poke!
sentry.assign(owner)
end
# Ensure last rollout did not degrade performance
if !today_performance_check.done?
follow_checklist(:platform_performance)
end
# Give customers more happiness!
if !today_rollout.done? && is_it_after_4pm?
follow_checklist(:production_rollout)
end
end
ensure
add_entry_in_duty_log_book
end

It might seem over-processed and lacking in fun AKA brain challenges, but it is not. Almost everything that could have been automatised has been.

Inside the different checklists, there are tasks that still require a lot of engineers’ brain CPU. For instance, the detection of a performance regression is automated, but how to tackle it still requires human analysis.

Example of an entry in the duty log book

When I say iterate, we did. A lot.

Rome was not built in a day. We ended up with this complex algorithm after hundreds of iterations. Maybe the key to success is that not only is the process repeated every day but it is done so by a different team member with fresh eyes. This means that as soon as something is not working, we adapt.

It might be that a task is becoming too repetitive or too long, and so we choose to automate it, like finding the top degraded endpoints regarding response time before and after the rollout..

Or, people would forget that they were on duty so we created a dashboard which is displayed in every team:

Sometimes we went too far, especially by adding too many responsibilities in the role. For instance at one point we asked those on duty to enhance the slowest transaction of the platform shown by NewRelic. It was clearly too complex and too long of a task so we rollbacked to a simpler version always following the KISS principle.

Also, it might be interesting to note that we do not have a formal ritual for continuous improvement. I am still the owner of this process and as soon as a developer has an idea or finds a step painful, he comes to me, we talk about it, and if we agree to make the change we will share it in the next tech-time (a bi-monthly meeting for all our engineers to shine and share).

Benefits

The Duty Guy is at the core of our engineering team and we do see a lot of benefits from it.

For starters, we can now rely on every engineer to respect best practices and we can breathe easy knowing that the production is being monitored like a newborn child.

Developers have fewer interruptions like when the build is broken or when there is a burst of errors in production, since they are protected by the Duty Guy.

But the most profound benefit is about giving more context to the developers. When a developer is wearing the Duty Guy’s hat he is on the front line of code’s impacts. At any point a single line of code might slow down the whole platform or a broken feature could generate thousands of calls to the support team. You build it, you run it. At Doctolib we do believe in developers with a total ownership of the whole feature lifecycle; from the idea to production.

It is even more advantageous for new joiners, as soon as you arrive you will be added onto the list of Duty Guys. Because of this, only two weeks after her arrival, Mélanie knew how to use NewRelic to monitor a transaction in production. It is also a nice sign of trust that when you arrive you are given the power to push the launch button of a spaceship like Doctolib!

It works, and it works well

Of course some developers can get ticked off when they realize that they are on duty, but overall, they understand the reasons behind why they are doing it.

Testimonials:

“It helps to be exposed to what other teams are doing and how they are doing it.”

“Pretty much everything is automatised, when everything is ok we don't waste too much time!”

“For a new joiner it is really beneficial: meeting the whole team, quickly discovering the tools, seeing parts of the application that we would never see otherwise.”

“I like the principle; it is empowering.”

“To always have someone on duty helps the others focusing on their tasks.”

Next

The Doctolib DevOps team is just starting its own Duty Guy, in the coming weeks we hope the two duties will work even more closely and be able to help each other. They could shake hands at the beginning of the day and pair program on some tasks like: setting up the log rotate policy; adding a new machine on the continuous integration or fixing a log into production.

Finally, we are still wondering about how we will be able to scale the Duty Guy. Will it work when we are fifty engineers? How can a developer remain connected with the process if he is performing it only every 3 months?

Find out more about what we are working on here

Doctolib

Nicolas De Nayer

Written by

VP of Engineering @ Doctolib — the #1 booking platform and management software provider for doctors in Europe

Doctolib

Doctolib

Pour un système de santé plus humain, efficace et connecté

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade