*Beep* *Beep* “PagerDuty Caller PagerDuty Caller”
--
Last June I started participating in the out of hours on-call rotation. I volunteered for this opportunity to have the chance to get closer to the infrastructure of our code and to learn how to handle incident response. Through this article by sharing by my experience I hope to give insight on how this can be implemented in a small team and demystify some of the expectations for those who have not yet had this experience.
At Ecosia we are nurturing a devops culture, meaning all developers take ownership of the services they work on including ensuring monitoring and logging is in place as well as fall back implementations, so if one service goes down everything is not brought down with it.
As a developer new to the industry this has been a fascinating topic to learn about, and not having any other point of reference for better or worse I have embraced it fully. As a way of getting to know what it entails I read the ‘Site Reliability Engineering’ book and then the fantastic ‘Effective devops’ by Ryn Daniels and Jennifer Davis. Both books gave me insight into how in various environments these practices had been implemented and helped me understand the practices the developer team had chose at Ecosia. It also gave me a chance to learn some of the vocabulary around this area of development and the principles behind it.
“Traditionally siloed technical teams interact through complex ticketing systems and ritualistic request procedures, which may require director-level intervention. A team taking a more DevOps approach talks about the product throughout its lifecycle, discussing requirements, features, schedules, resources, and whatever else might come up. The focus is on the product, not building fiefdoms and amassing political power.”
― Mandi Walls, Building a DevOps Culture
When I joined Ecosia there were two engineers sharing the out of hour on-call responsibilities and the rest of the dev team rotated a primary on-call shift during office hours. When one of the two left the company and while their replacement was being hired and on-boarded it was necessary to have other engineers take a primary on-call role in out of office hours to relieve some of the pressure on the remaining developer who took a secondary on-call role. This meant that while most of the incident handling would be done by the primary on-call engineers they would be able to escalate if they came across something they were unable to mitigate either because they had exhausted their knowledge of the system or had insufficient permissions. As this was not going to be a permanent solution and it was planned I would have three weeks of on-call over a three month period, it felt manageable and an opportunity to find out more about what on-call entailed.
Tooling — knowing when things happen and finding out what is going on
At the time we had roughly 30 individual applications which we hosted on our own Kubernetes set up. Being able to observe applications and the environment they are running in is key when handling an incident. At Ecosia we are able to view the services, their logs, information about their deployment, the pods they are running within and the nodes those pods are running on with tools such as kubectl
, kubectx
, kubens
, and stern
.
We also have metrics set up via Prometheus which we then observe on dashboards built with Grafana. These allow us to see trends within the behavior of a application such as requests times or error rates and gain insight to what is going on not only during incidents but also in general. Another tool we use is Pingdom, a service which regularly pings our applications from various servers to check if our applications are reachable and how long external requests are taking.
With these observeability tools we are also able to set rules and thresholds for the expected behavior of our applications and set up alerts for when our applications fail to meet them. For example if an application is emitting a high error rate for a prolonged period of time we want to know about it. Non critical alerts go to either our shared dev email or one of our slack channels, while alerts deemed critical are sent to PagerDuty, a service that routes alerts to the on-call phone. With it we are also able to schedule the on-call rotation, see the status of an alert, as well as set rules for escalation times, for example what to do if an alert is not acknowledged.
Runbooks — where to find help
When a alert comes through it is important to try mitigate the incident as soon as possible, this means getting the service up and running again as soon as possible so there is as little user impact as possible. This may mean reverting recent changes or turning off non critical services. After the incident has been mitigated we can start to look for the root cause if it is not already known.
Some incidents may reoccur or behave similarly to previous incidents we have encountered. Additionally if a certain service is down it helps to know any steps that will help bring the service back to stability. For this we have created Runbooks or Playbooks as they are sometimes called. These document steps for how to respond to certain incidents as well as links to the dashboards and documentation for the service in question.
Our runbooks help us to quickly find the information we need to try mitigate incidents and/or find out more information about what is happening. This is really useful as receiving a page can be quite stressful and when they happen during the night the on-call engineer may not be as focused as otherwise.
Training — the dungeons and dragons of on-call
As well as getting familiar with the tools mentioned above, it was decided to do some in-house training as the other developers and myself joining the on-call rotation were new to the process and also working mainly on application code oppose to platform or infrastructure code. The engineer leading the session came up with scenarios (some of which had happened previously at Ecosia) and gave us the opportunity to talk through how we would try mitigate them. For this we didn’t touch our laptops but instead asked out loud questions such as ‘I look at the error rate panel in the Grafana dashboard, what do I see’.
We were also able to recreate some incidents on our staging cluster where we could then practice using the various tools and techniques learnt. Of course there is no way to truly replicate an incident but this helped us prepare ourselves and in a group environment discuss what could be done. For me this took away a lot of fear or anxiety about being on-call. I also learnt a lot about how our applications were being served and gained confidence in how to interact with them.
Life out & about while on call — first night, saunas, evenings out, my partner and other peoples reactions
Before going on-call I felt quite anxious to how it would be, mainly I wondered about my ability to wake up if an alarm went off during the night, and if my sleepy brain would be at all capable of identifying the problem let alone fixing it. I was also concerned if my phone was set up correctly to receive the alerts and ignore the volume settings I have for all other notifications. I actually left the volume up on my other notifications for a while because I was so nervous they would not go through (even after testing this several times through self made alerts on PagerDuty).
In PagerDuty you can set how your notifications are handled and I had it set up so I would get a push notification immediately and then two minutes later a phone call. PagerDuty would escalate to the next level (the secondary on-call engineer) if I did not respond in three minutes total so it was important I had my phone nearby at all times.
At the time we had quite a few noisy alerts (alerts that went off but resolved within a couple of minutes) so the first evening my phone was going off a lot and I later found this was due to a configuration error and was able to fix it. However honestly I will say that those noisy alerts did offer some comfort — at least the phone worked!
At night I had my phone set to do-not-disturb which did not allow the push notifications to come through. However I had white-listed the numbers from PagerDuty to come through even during do-not-disturb mode, meaning those alerts that resolved within one to two minutes did not come through to me during the night, which was in fact all of them and I didn’t get called once.
This should have meant a peaceful night sleep, but I must admit my anxious brain woke me up more than a few times to check indeed no call had come through and I had simply missed it. This got easier after the first night but for the whole week was always somewhere in the back of my mind.
In our training we had discussed how being on-call may affect our lives and I already had a fairly good idea of what to expect. For the weeks I was scheduled on I made sure to have no weekend trips or big nights out planned. Though this didn’t stop me going out. We were told we were not expected to be tee-total during the time and it was at our discretion. I personally felt fine to have a few drinks in the evenings. I also felt fine to be out during my time on on-call. I either took my laptop with me in my handbag or stayed close enough to our building that I would be able to get home quickly should an incident occur. It was the weekend of 48 hours Neukolln in Berlin so there was plenty going on outside our door step.
I am very proud to tell people where I work, I feel incredibly lucky to work at a company having such a positive impact on the world and to work with such a range of interesting and passionate people. So I felt very happy to inform people I was now taking on responsibility to cover the on-call rotation. It was often at this point the phone would go off in my hand (mainly those noisy alerts I mentioned before) and I would have to apologise and step to one side. Most people were interested to know what it entailed and how I was finding it. It’s part of the reason I decided to write this blog post.
Before going on the rotation there was one person I had to discuss it with and that was my partner, mainly because there was a real risk my phone going off in the middle of the night would wake him as well as me. He has been incredibly supportive during my change over to tech, in fact he encouraged me to do it in the first place and with this it was no different. He also understood that I needed to be mindful of what I took on during my rotation time and that I would be absent if a Page came through.
One activity that I hadn’t prepared for was that I normally go Saturdays to the gym and then in the Sauna. This posed the problem of what to do if a Page came through. While working out it wouldn’t be a problem to dip into the changing rooms and pull my laptop out but the Sauna would be harder as I would not have my phone inside with me. I debated not going at all but then reconsidered. I decided it would still be possible if I had the phone on high volume outside of the cabin where I would still hear it. I made this decision also based on the fact it was the weekend and no developers were working so it was unlikely (and ended up being the case) that no Pages came through and I was able to have my Sauna :D
Handling an incident
In the end I didn’t have to handle any incidents with high customer impact while I was on primary on-call. Though I have assisted during a couple by taking notes and checking logs while another developer led the incident management.
One night I was woken by a Page regarding a service being down. I immediately got my laptop out and started to look at the logs and monitoring for that service to see if I could figure out how to mitigate the issue and get the service back up. I tried a few techniques we had done during our training such as restarting the pods the service was running on to no avail. For each thing I tried or learnt I made a note in our on-call slack channel so I could keep track of the incident and if I needed to escalate the second engineer would be able to get quickly up to speed.
At the time we didn’t have a concrete notation for what services had which level of escalation priority and as I couldn’t get the service back up I decided to escalate. The secondary on-call engineer joined me in the slack channel and I explained the situation. They suggested a few things we could try but pointed out this was actually only affecting a secondary service which was not business critical and could therefore be left until business hours to resolve.
In the end we decided as the priority level for this service was low to leave resolving the issue until the following morning, when we were sure with fresher minds we would find a way to mitigate it. This proved to be a good course of action as with a more awake mind I was quickly able to find a solution, what definitely did help was the notes I made while investigating the incident!
Having a clear idea of what is business critical and what can wait until the next working day is very important as is having a clear idea of when to escalate. When I decided to escalate I felt I had exhausted all the resources I had and would be better equipped to tackle the incident with support. Luckily I was not where the escalation policy ended and had another engineer to escalate to. It would have been much more stressful if I had full responsibility to handle an incident alone.
Assessing the process after the fact — blameless postmortems
While handling incidents when they happen is a key part of being on-call it is also very important to have a self evaluating model to make improvements for future incident handling. We do this through creating a document we call a postmortem that documents how the incident was handled, how the problem was mitigated, the root cause if known and also importantly any action items that may need to be taken. At Ecosia we normally also schedule a meeting to go over the postmortem document. In these meetings those who handled the incident plus some other engineers (perhaps those who own the affected service or those also in the on-call rotation) go through the time line of the incident and try to see if there are any improvements we can make to our process to either prevent certain things happening or be able to mitigate problems quicker. For example if we notice it is taking too long to escalate an issue to the secondary engineer we may look at how we define when an incident needs to be escalated.
An important thing to note in our approach of these documents and meetings is that they are blameless, meaning that they do not name or direct responsibility to individuals. Instead engineers are referenced by their role and the idea is to see how the process can be improved rather than laying blame on the specific people that were involved.
Compensation
While I didn’t primarily take part in the on-call rotation for monetary reasons I think it is important to discuss compensation. On-call at Ecosia does not mean you are working 24/7 however there is an expectation that you are available and reachable during the entire on-call shift, if for some reason you will not be available you are responsible for finding someone to cover. This is definitely intrusive and my personal opinion is that no amount of money can truly compensate for my free time, however what I appreciate at Ecosia is that as well as monetary compensation we were given time back for any incident handling that happened outside of working hours. The time awarded was rounded up to the nearest hour and doubled. So if the pager went off and I had to spend 20 minutes to mitigate it I would be compensated with 2 hours. We recorded this compensated time in a spread sheet and were encouraged to take it within the week after the incident, though if that wasn’t possible we were entitled to take it at a later point.
The policy at Ecosia for on-call compensation is currently in development as we look to set up a new rotation with a growing team and growing number of services. Before taking on an on-call position I would definitely ask about about what compensation has been arranged including time compensation and also about possible turn over time (ie if you’re woke up at 2am by a page at what time are you expected to be in the office the next day)
Summing it up
If you have the opportunity to be involved in an on-call rotation it can help you gain a lot of insight into the infrastructure your code is run on and how to handle incidents. I would however advise you to think about the impact it may have on your out of work life. This will depend a lot on how your company handles on-call, things to consider would be how often you will be on-call and for how long. A week of on-call at a time was definitely strenuous for me and I preferred to have at least two weeks between shifts. In what amount of time are you expected to acknowledge an alert and if you are really stuck who can you escalate to. Don’t forget to also think about how this commitment may affect family members or people you live with particularly if you may get paged late at night.
I’d also like to acknowledge that being part of an on-call rotation is not possible or enjoyable for everyone. While my experience was overall positive there are definitely plenty of people who do not report that as their experience. There are very good reasons why this may not be something you can or want to take part in and it will depend on your companies on-call program (compensation, escalation policy, availability expectation, if you are able to opt out). You may think instead about asking to join training sessions if your company is running them or supporting during work hour on-call shifts to pick up some of this knowledge.
At the moment I am no longer in the on-call rotation as we are rearranging at Ecosia how our on-call is set up but I would consider doing it in the future should the opportunity be there.
Resources
If you are interested to know more about either being on-call or building a devops culture here are some links that you may find useful.
Building a DevOps Culture by Mandi Walls
A story of being on call — Charity Majors at Monki Gras 2018
Fantastic talk about training for on-call by Franka at Mapbox