The Duty Dev: the key to empowering engineers

Published in

Doctolib

6 min readNov 21, 2017

It all began two years ago when the Doctolib tech team started to grow. We started as four engineers, and now we are nearing forty. We set up a process so every day an engineer would be on duty for the whole team. Thanks to him, everyone else can stay focused since the Duty Dev will take over all the small obstacles of the day and be in charge of releasing the new features at the end of the day — But there is more to it than that!

In the beginning the two founders were in charge of every step of the product development. As a part of this, they had control over all the daily pushes to production. However, this had harmful effects on the developers, as they knew that they had a safety net, and their mistakes would be ultimately overhauled by the founders. The two founders stayed everyday after 8pm to double check every new line of code, wait for the traffic to calm down, revert code in case of problem, and so on.

Start with the rollout…

We decided everyone on the team would have the responsibility to push new code to production. To do so we would have a dev on duty.

It is a rotating role, which means, everyday a new developer is in charge of pushing the button to production.

The first process of the Duty Dev was pretty simple and focused on the rollout:

if !today_rollout.done? && is_it_after_4pm?
 follow_checklist(:production_rollout)
end

…and iterate!

It worked very well; after the first couple of weeks everyone had a chance to take part in a roll out and after a month or two the process became routine. The duty role has evolved quite a lot since then. We even have an algorithm for it now:

Disclaimer: one should know that we use sentry for error tracking to monitor and fix crashes.

begin
 while self.duty? do
  # Keep master green
  while continuous_integration_status == 'fail' do
   follow_checklist(:troubleshoot_continuous_integration) 
  end
  # Ensure no critical errors will be released
  staging.sentries.each do |sentry|
   if sentry.created_at > 2.days.ago
   owner = identify_the_most_adequate_person
    owner.poke!
    sentry.assign(owner)
   else
    conclusions = investigate(sentry)
    Jira.create!(conclusions)
   end
  end
  # Ensure no poop has been released
  production.sentries.unassigned.each do |sentry|
   owner = identify_the_most_adequate_person
   owner.poke!
   sentry.assign(owner)
  end
  # Ensure last rollout did not degrade performance
  if !today_performance_check.done?
   follow_checklist(:platform_performance)
  end
  # Give customers more happiness!
  if !today_rollout.done? && is_it_after_4pm?
   follow_checklist(:production_rollout)
  end
 end
ensure
 add_entry_in_duty_log_book
end

It might seem over-processed and lacking in fun AKA brain challenges, but it is not. Almost everything that could have been automatised has been.

Inside the different checklists, there are tasks that still require a lot of engineers’ brain CPU. For instance, the detection of a performance regression is automated, but how to tackle it still requires human analysis.

Example of an entry in the duty log book

When I say iterate, we did. A lot.

Rome was not built in a day. We ended up with this complex algorithm after hundreds of iterations. Maybe the key to success is that not only is the process repeated every day but it is done so by a different team member with fresh eyes. This means that as soon as something is not working, we adapt.

It might be that a task is becoming too repetitive or too long, and so we choose to automate it, like finding the top degraded endpoints regarding response time before and after the rollout..

Or, people would forget that they were on duty so we created a dashboard which is displayed in every team:

Sometimes we went too far, especially by adding too many responsibilities in the role. For instance at one point we asked those on duty to enhance the slowest transaction of the platform shown by NewRelic. It was clearly too complex and too long of a task so we rollbacked to a simpler version always following the KISS principle.

Also, it might be interesting to note that we do not have a formal ritual for continuous improvement. I am still the owner of this process and as soon as a developer has an idea or finds a step painful, he comes to me, we talk about it, and if we agree to make the change we will share it in the next tech-time (a bi-monthly meeting for all our engineers to shine and share).

Benefits

The Duty Dev is at the core of our engineering team and we do see a lot of benefits from it.

For starters, we can now rely on every engineer to respect best practices and we can breathe easy knowing that the production is being monitored like a newborn child.

Developers have fewer interruptions like when the build is broken or when there is a burst of errors in production, since they are protected by the Duty Dev.

But the most profound benefit is about giving more context to the developers. When a developer is wearing the Duty Dev’s hat he is on the front line of code’s impacts. At any point a single line of code might slow down the whole platform or a broken feature could generate thousands of calls to the support team. You build it, you run it. At Doctolib we do believe in developers with a total ownership of the whole feature lifecycle; from the idea to production.

It is even more advantageous for new joiners, as soon as you arrive you will be added onto the list of Duty Devs. Because of this, only two weeks after her arrival, Mélanie knew how to use NewRelic to monitor a transaction in production. It is also a nice sign of trust that when you arrive you are given the power to push the launch button of a spaceship like Doctolib!

It works, and it works well

Of course some developers can get ticked off when they realize that they are on duty, but overall, they understand the reasons behind why they are doing it.

Testimonials:

“It helps to be exposed to what other teams are doing and how they are doing it.”

“Pretty much everything is automatised, when everything is ok we don't waste too much time!”

“For a new joiner it is really beneficial: meeting the whole team, quickly discovering the tools, seeing parts of the application that we would never see otherwise.”

“I like the principle; it is empowering.”

“To always have someone on duty helps the others focusing on their tasks.”

Next

The Doctolib DevOps team is just starting its own Duty Ops, in the coming weeks we hope the two duties will work even more closely and be able to help each other. They could shake hands at the beginning of the day and pair program on some tasks like: setting up the log rotate policy; adding a new machine on the continuous integration or fixing a log into production.

Finally, we are still wondering about how we will be able to scale the Duty Dev. Will it work when we are fifty engineers? How can a developer remain connected with the process if he is performing it only every 3 months?

The Duty Dev: the key to empowering engineers

Written by Nicolas De Nayer