Our team at Soluto is comprised of developers, product managers and UX designers. The team works together on a few dozen mobile applications and on our large scale micro-services backend. As part of our DevOps culture, we encourage small and frequent releases. As a result of this approach, changes to our backend are very frequent. Add to that the very strict SLA standards our customers require us to stick to, and we have a need for excellent monitoring.
For a few years now we have been using an on-call developers rotation, where the on-call developers are responsible for responding to production incidents. Most of the alerts are routed from Icinga through PagerDuty to the on-call developer. The on-call developer is responsible for investigating issues, escalating incidents, and informing the relevant person. Icinga configuration files are on the Icinga server file system, in addition, Icinga uses a DSL for these configuration files. Therefore, changing the configuration in Icinga is a cumbersome process which requires connecting to the Icinga machine via SSH and then editing a few files of Icinga DSL. Since we did not want to exhaust our developers, we decided that only a few people will deal with changing the monitoring configurations.
The path that we took for changing monitoring configurations had a few pitfalls. First, the fact that only a few people can make changes is counter DevOps and makes the process of changing monitoring configuration slower. Instead of changing the monitoring configurations at the same time as changing the actual code, the developer needs to wait for someone else to make the changes for them. Second, since the developers are not the ones that actually make the changes to the monitoring configurations they do not feel obligated to update the monitoring configurations as much as they feel obligated to fix their code or their tests. This created a difference between the code that was deployed and the actual stuff being monitored. Sometimes we were missing monitoring tests, and other times we were getting false positives, since the monitoring tests were not coordinated with code changes. Third, and most importantly, this approach commits only a few people to the monitoring effort instead of the entire team.
We decided we need to tackle the area of monitoring. The solution needs to incorporate the entire team into the effort without making too many changes to our current workflow and how we develop the product. All the changes need to be audited and reversible in order to let everyone in the room make mistakes and fix them quickly. The solution, of course, came from automation. We started to use Puppet with Icinga2 puppet module to manage all the changes to our monitoring configuration. We use Puppet Code Manager which lets us use Git repository for the entire configuration. The use of Git in the process handles the heavy lifting of auditing and gives us the ability to revert a change in the case of a misconfiguration. The Puppet module changes the syntax of Icinga from a DSL to a JSON objects which all the developers are familiar with.
The changes we made improved our monitoring coverage significantly and help us keep the monitoring configuration updated with the code changes. Yet, we still felt that something was missing. We had better monitoring and more accuracy, but all alerts were being routed to the developer on-call, who got swamped with all of these additional alerts. In order to commit the team further to the monitoring effort and not to exhaust the on-call developer, we decided to route most of the alerts to the team that is responsible for them instead of the on-call developer. Every micro-service in our backend has a team that is responsible for it. This responsibility now includes responding to alerts during work hours. Outside of work hours we route only urgent alerts to the developer on call and he can decide to escalate the alert to the responsible team.
These changes made our team and our monitoring much better. The team is now more aware of the efforts around monitoring and everyone takes part. Our Puppet Git repository has over a thousand commits from almost every person in the development team. The changes also include our product managers (!), who now configure and change monitoring settings in order to know when a KPI of their team is changing. Product managers, though, mostly use our Mixpanel Nagios plugin.
Looking forward, we would like to make the process even better by adding tests to the Puppet Git repository. The tests will validate every change in the monitoring configuration and will shorten the feedback loop for each deployment. We will also need to migrate to the new Puppet Icinga module in order to be aligned with the rest of the community.