How to Monitor Failed Puppet Runs: Graylog and Sensu for Logging and Alerting
Our company has nearly one thousand Linux-based virtual machines, serving all kinds of applications and services. We service the administration for pension funds and host web portals for their participants, so our infrastructure has to be up at any time. As a DevOps engineer, it’s not realistic to log into all machines to check if they’re doing OK. Therefore, it’s good practice to send the log messages to a log management platform that provides search, dashboard and alerting functionality. This will save you lots of time and minimize system downtime. I already had some experience with Splunk, but we wanted an open source solution (much more affordable) with customizable alerts. We ultimately chose Graylog.
In this article I will explain how we used Graylog to monitor the success or failure of Puppet runs, and how we implemented alert handling on the Sensu platform to address the problem of alert fatigue.
The Use Case: Using Log Data to Identify Failed Puppet Runs
We use Puppet to describe our “infrastructure as code” and to provision our systems. Each system executes a Puppet run every 30 minutes to keep it in line with the latest code changes. Despite thorough testing, we would release a new version of a Puppet module and essentially hope for the best. Because of the rapidly growing number of machines, verifying whether a Puppet run succeeded or failed on all of them quickly became an impossible job to accomplish using manual methods.
Because we were slowly falling behind and becoming blind to whether a Puppet run was a success or not, we decided to implement a log management solution to help us gain visibility using log data. We set up a Graylog stack and forwarded our Puppet log data from each host to Graylog via Syslog inputs. We configured Graylog streams for real-time Puppet log data and set up alerts to trigger upon errors. Every team member was notified via email using the standard Email Alert Callback every time we experienced a Puppet failure, as shown below.
This setup gave us a big productivity boost, because we no longer had to do manual monitoring by watching logs or dashboards. We were now automatically notified when errors occurred.
The Problem: Alert Fatigue
We were happy to have solved the automatic error notification issue, and life was good. But after running the Graylog setup for a while, it became clear that we were receiving too many notifications. When a Puppet run failed, every team member would receive the same e-mail for every host that suffered from the issue, every 30 minutes, until the problem was fixed. Our mailboxes were overflowing with alerts, which made it difficult to identify new alerts and led to alert fatigue.
What we needed was alert deduplication and, if possible, implementation of a schedule so that the right resource gets notified depending on the day of the week.
The Solution: Deduplication in Sensu
Graylog has a Marketplace with official and community-contributed plugins to help you integrate Graylog with other tools. There is an existing plugin for PagerDuty, which provides this kind of functionality (and much more) out of the box, but PagerDuty isn’t free or open source so cost was an issue, and we were already using Sensu for monitoring. Sensu is an open-source framework that runs on-premise, is very flexible, and allows clients (a native client comes with the framework) to send check results (in JSON format) to a variety of handlers that trigger actions based on the specifics of the event(s).
You can write your own handler or use the existing Sensu handlers. Sensu also has many of the available alarm callback plugins for Graylog like PagerDuty and Slack. Just take a look here to get an idea. By using Sensu as an intermediate platform, you actually have even more opportunities to integrate with other systems. Sensu also comes with the Uchiwa dashboard that can be used to visualize event data.
Now, let’s get back to deduplication. The Sensu server keeps track of the number of occurrences of unique events. In a handler script, it’s possible to send an alert by e-mail only on the first occurrence. This is exactly what we needed.
Sensu Alarm Callback plugin: Getting Events from Graylog to Sensu
To get the events from Graylog into Sensu in the first place, I started by developing the Sensu Alarm Callback plugin for Graylog, which is now available for download on the Marketplace.
The plugin connects to a RabbitMQ instance, which the Sensu server relies on. Normally, the Sensu server receives events from a Sensu client containing a host name and a check name. The combination of the two makes the event unique. Graylog generally represents many hosts, so you can configure the plugin to generate 1) the host name, based on the [source] of the log message, and 2) the check name, based on the [stream] name. You can define which Sensu handler(s) should take action on the events, and the subscribers (team members) that should be notified.
Customizing Your Sensu Handler
By default, Sensu keeps track of occurrences of unique events. I modified an existing mailer handler so that it only sends an email for the first occurrence of the event, by overriding the filter_repeated method. While I was editing the script, I also implemented a schedule so only the team member who is “on call” on a specific day receives the e-mails. You can make this as advanced as you need. If you’re looking for an example, feel free to dive into the source.
The Final Setup
The final setup looks something like below. In conjunction with PagerDuty and Slack, the e-mail functionality probably doesn’t make much sense, but they’re all there to illustrate that you can give Sensu many handlers to take care of your events. You can also see that we repurposed an existing Sensu server that handles events from clients other than Graylog. Please note that although these are all Linux nodes at the moment, the OS doesn’t really matter, as the Sensu tooling is also available for Windows, for instance.
Final Note: Marking Alerts as Resolved
So what happens when a Puppet run fails? Graylog will raise an alert based on our criteria and call the Sensu Alarm Callback plugin to send an event to the Sensu server. Sensu triggers the scheduled_mailer handler and the team member who is on call receives an e-mail. An overview of the active alerts is available in the Uchiwa dashboard.
Once the issue has been solved, the alert must be marked as resolved in the dashboard. This is necessary in case a new Puppet failure occurs on the same machine. If the alert is not marked as resolved, the number of occurrences would increase but no e-mail would be sent. The content of the e-mail corresponds with the output field of the event, as shown below.
Thanks for reading all the way to the end, I hope this provides some inspiration. As you can see, Graylog is an excellent solution for your logging and alerting needs, and using it along with Sensu makes it even more powerful. It’s awesome that Graylog is so easy to extend with your own plugins. The possibilities of alert handling based on logging are nearly infinite.
If you have any questions or suggestions after reading this article, just leave a comment below. If you found this useful or interesting for someone else, don’t hesitate to share it.