How On-Call improved my Work-Life Balance

Fredrik Fischer
Nordnet Tech
Published in
5 min readMay 24, 2023
Photo by Jared Rice on Unsplash

My first week at Nordnet, I remember as structured chaos! A crucial part of our system was broken and our most senior developer sat in front of the computer screen, deeply focused on fixing the problem.

More and more people started to gather around the lonely developer like an onion with more layers as the clock kept ticking.

Photo by Nathan Dumlao on Unsplash

After an hour or two the problem was fixed and the bank was functional again. All credit to my fellow developers but there was a lot of things, especially as being the new developer that was not very clear:

  • Who is responsible for taking action once a critical error occurs?
  • Who reported the error and when did it occur in the first place?
  • How do we know that someone is available to take action immediately, also outside of office-hours? (The bank is always open)
  • Which systems of the bank are our team responsible for?

We handled the situation well at the time but there was still some uncertainty of the process and status of our systems. Sure we had logs, metrics and some alerts, but it was not good enough working in a business where customer trust is essential.

Fast-forward to today, I feel much more confident in our systems, I can run for errands over lunch, hit the gym or pick up my kids early once in a while. I trust that we always have at least one person in our team 24/7 who will notice and take the lead if a critical error occurs. A side effect is also that the quality of our systems has increased and I feel confident releasing software updates multiple times a day.

How does operations responsibilities work today?

Nordnet has in the last couple of years moved from an on-premise setup to a cloud setup, and we have also changed how we work with operations. The mindset we have today is based upon the notion:

“You build it you run it” — A Conversation with Werner Vogels

Having this approach means that we as developers are responsible for the code we write and the systems we build. This responsibility goes beyond ordinary business hours and requires us to have a developer available 24/7. This in turn, has a lot of positive effects on the quality of our systems and the speed of our development. To mention some of them:

  • Frequent releases and small changes means that the changes are specific and trivial to evaluate
  • A deployment pipeline which allows easy rollback
  • Extensive monitoring and metrics setup who gives full transparency in the application status
  • Extensive test suites, makes us comfortable to deploy to production at any time — Coding for rapid releases, don’t forget the basics!
  • Accurate and intelligent alerts that notifies us if something goes wrong — Mastering Alerts in Cloud Applications

This setup makes it easy to be the developer on-call. As we know the inner workings of all the systems we build and we have tools and pipelines that help us in debugging. Though this is rarely needed as the systems are kept to a very high standard.

  • The on-call developer can rest assured that a notification will be sent out if something goes wrong.
  • The on-call duty is compensated monetary with a fixed sum and is not connected to the number of alerts responded to or fixed. This makes it a common motivation for the team and Nordnet to make sure that the applications are always functional.

Having an alert going off and having developers spending time fixing urgent problems is not a good situation to be in, as we want to be able to focus our development efforts on new features.

The mindset we have regarding proactive and reactive work is inspired from Stephen Covey‘s Important/Unimportant matrix:

When an alert is triggered, this is a case of Urgent and Important work, and the highest priority is to fix the issue. This is reactive work and we want to avoid this as much as possible.

It is therefore important to fix the issue as quickly as possible to get back to a stable state in the Not Urgent/Important quadrant, and afterwards makes sure to:

  • Analyse the root cause of the incident, was it a one off or is it a recurring issue?
  • Prioritise actions in the team, in order to prevent the issue from happening again.

This way of working iteratively increases the quality of our applications and maximises the output of our development teams.

Conclusion

Having operation responsibilities, essentially means that we need to take responsibility and consequences for your development efforts.

This creates an intrinsic driver from all developers to write quality systems and applications. This is now an integral part of our daily way of working and how we design systems and the work is always ongoing as there is continuously room for further improvements.

Having bad code quality or an immature process dealing with the alerts could quickly spiral out of control. This could result in developers spending more time responding to alerts and fixing bugs and less time building new features.

“Why didn’t you write the same quality systems before?“

“Do we need to have an on-call setup in order to achieve this?

Our experience is that the on-call rotation provides a great incentive to improve system quality as the responsibility provides a holistic view of the ownership end to end. This provides us with the insight that good code quality will reduce the amount of time spent operating the systems.

  • As a part of the on-call rotation we have a weekly schedule with one developer always ready to take action on incoming alerts, allowing the rest of the team to relax outside of office hours
  • The alerts is setup by us who designed and built the system, this means that we know exactly what triggered the alert and in most cases we know how to approach the issue
  • The scope of the on-call duty only covers the applications built by the team and therefore ensures clear visibility and responsibility.

Authors

References

--

--