Using Pingdom and Slack for Real Time Monitoring of Production Systems
Once your software application has gone live, its important that it stays up and active monitoring is an integral part of the equation. There is no major need to emphasize on the importance of monitoring systems and therefore, I will directly get to the point of explaining how to use Pingdom and Slack for realtime monitoring of various production grade services. A large part of this is derived from my learnings building technology and product at Tyroo Technologies, largest Mobile Ad network (after Google and Facebook) in India and South East Asia.
An ad network, or for that matter any technology product comprises of various components that work together in a seamless synchronous or asynchronous manner. A slowdown or degradation in any of these services can have a larger impact on the overall health of your platform. The system administration and development team at Tyroo have known this and therefore wanted to deploy a solution (and there are multiple ways to solve this) that will be price sensitive (we are a startup !!!) , easy to setup and will have the capability to alert a large group of users on a real-time basis. Our choice was to mix and match Pingdom (a web performance management tool) and Slack (an awesome collaboration tool).
Why do we love Pingdom?
Pingdom has been our all time favorite web uptime / downtime monitoring tool for a various reasons, some of them are:
- They have monitoring agents closer to all our top markets as well as the markets we are strategically interested in. This is important as you want to pick a tool that has monitoring agents closer to your top markets.
- Their pricing is lean. For as low as $66 (round about) you get 80 checks and roughly 350 SMS notifications and ability to pick test locations.
- We found Pingdom to be the quickest in detecting downtimes in our own checks of various other services.
- Pingdom also offers its own notifications in case it detects a downtime on your service. If you are a single man army, go for it. If you want to work as a team on an incident, collaborating on finer points slack rules !!.
and what’s so special about Slack?
We are fans of Slack and between technology and product team use that as our primary collaboration tool. Our choice of using Slack for alerting us of service degradation or downtime was because of the following
- It has the ability to reach all of our team members.
- It can provide real time notifications in case of an issue. Everyone in the team has to have slack mobile app installed on their phone. ;)
- Slack allows its users to receive notifications from only selective channel(s). This is critical as it ensures that only some notifications are pushed to your phone, keeping the other not-so-urgent and spammy messages (“Who all want to eat pizza for lunch today?”) away.
- Integrations with other software tools is what makes Slack awesome.
Integration — Making Pingdom and Slack Talk to each other
Head to browse apps, select Pingdom and start by adding a new configuration to this integration.
- You will be required to setup the channel which you wish to be notified whenever there is a webhook notification sent by Pingdom to Slack. In our case, we have a #alerts channel that gets notified.
2. Further down you will also see a webhook url that should be copied and pasted into pingdom as mentioned in the steps below
3. Login to your Pingdom account. From your Pingdom dashboard, click on Alerting in the left navigation, and then choose Alerting Endpoints from the sub-options.
4. Press the Add Alerting Endpoint button and on the following page, give your enpoint a name.
5. Press the Add Contact Method button and select URL / Webhook from the dropdown list. Then add https://hooks.slack.com/services/T038W8T6H/B039X7346 as the Webhook URL (your webhook URL needs to be entered here) . Press the Add button to close the dialog.
6. Make sure that you have New message format selected, and then press the Save Settings button.
7. Next, click on Alert Policies in the sub-navigation along the left. You can add a new Alert Policy, or edit an existing one.
8. In the Assign To textfield, type the name of the Alerting Endpoint you created in Step 2 above. In the Delay dropdown list, choose how long you’d like to wait after an incident occurs before before being notified in Slack.
9. Press the Add step button to add this method to your Alert Policy. Press the Save button when you’re done.
10. Click Dashboard in the left navigation, and then click on the site that you’d like to monitor in Slack.
11. In the Beep Manager at the bottom of the page, choose the Alert Policy that you configured inSteps 7–9 above. Press the Modify Check button at the bottom when you’re done.
Adding custom scripts to this whole mix
While all your services available over HTTP can be monitored using the setup mentioned, not all services are available over HTTP for various reasons (including security). To tackle this very challenge, we have built several custom scripts that work as follows:
- Setup a script that is accessible over HTTP.
- Pingdom agent polls this script.
- Script internally does the check (for example: one of our check is to monitor for queue lengths. If length of queue crosses our threshold setup, then an alert should be raised).
- If the check passes, then issue 200 OK as response. Else, issue 500 to which Pingdom will send an alert that will be passed over to our Slack setup as discussed earlier.
At Tyroo we use several tools and scripts(APM’s, custom scripts, Nagios etc…) to monitor our entire infrastructure and software services stack. A mix of Pingdom and Slack is our most cost efficient and proactive setup that also helps us watch the watchers (scripts that alert us within our infra of degraded infra or other issues).