Monitoring, beyond pinging the healthcheck

Amir Fatemi
Developers Writing
Published in
5 min readSep 6, 2018

It’s a lazy afternoon, you’ve already started thinking about tonight’s Netflix pick or maybe thinking about something low-carb for dinner. Suddenly someone starts shouting in slack that your website is down! While you are thinking that maybe someone forgot to connect to the internet again, your on-call phone starts ringing! Oh boy, it's not a pleasant moment, suddenly a countdown starts and turns laid back office into Gordon Ramsay’s hell kitchen. People are sending messages like the server is down and everyone is in panic mode. But you have to calm down and start reacting.

Obviously, you start with the web app server, ssh to the machine, htop first and if things are ok with resources, you dive into the logs, hoping to find an error line that points you to the culprit. But sometimes, depending on your infrastructure complexity, you have to dig into more layers. Maybe it wasn’t your app but Redis crashed or maybe some bad Nginx config. Whatever it is, you will find it at some point just that it adds another white hair to your head.

I was in the same situation a week ago, but I didn’t ssh to any servers or even check the logs, the first thing I did was open our Datadog dashboard, and right away I noticed our MySQL instance is at almost 100% CPU usage, and that slowed down response time, but why? To answer that question you need to open up your application performance monitoring dashboard, where you register all query calls that your app makes (together with endpoints hits, errors, etc). Again just after loading the page, something captured my eyes immediately. Right on top of the query list, I could see one select statement with a couple of unions, joints, and other stuff that can destroy your afternoon. Now you have to look at the code to see what is really happening there. Once I was there I noticed it was newly deployed code, well that is good enough, time to revert the changes back and call it a day.

Still, there is one question that we have to ask, why did it happen a few hours after deployment? we can connect the dots by looking at APM dashboard and DB instance CPU usage metrics. Exactly at the same time, our CPU usage was increasing, an API endpoint was bombarded by our partner requests, of course without scaling and proper throttling, MySQL instance at some point raised a white flag. Now we can create action items and avoid this situation for good.

It is impossible to troubleshoot that fast in a complex infrastructure without ongoing metrics and insight into your infrastructure. Especially when the team is under pressure due to downtime, which is why it is important to establish proper monitoring. Regardless of what tool you use for it, it is important to what you are monitoring and how you can process that data to valuable information about your operation. Beyond simple health endpoint ping, we need to gather much more information not only about the status of your services and components but also instances and containers that they are running inside. but what is interesting to monitor?

Hosts

Either bearmetal server or cloudbase virtual, it is always vital to keep track of your machine's resources. Gathering data about your machine can help you a lot when you do capacity planning or in a critical case, it helps you to detect resource misuse or downtime. To monitor your machine you can look for the following:

  • Resource Utilization & health: CPU, memory, and disk usage of your machine, can help a lot for capacity planning and tuning
  • Network traffic: It gives you insight into the network layer of your machine, and could be life saver in fast and tense troubleshooting effort
  • Process and memory usage per process: it is good to know what kind of process you are running on your machine and the usage per each process, so if next time someone runs a bitcoin mining cluster in your infra, you can kill it right away
Monitoring dashboard in Datadog (source)

Application

Besides the machine, another layer that needs to be focused on is your running application itself. That kind of monitoring usually is referred to as application performance monitoring (APM). monitoring applications not only help to detect and react to incidents faster but also provide great insight into your application performance and can be used to make your application more efficient and faster. APM generally provides the following metrics:

  • Transactions: information such as how many times a particular SQL query has been executed and the average load time for each execution
  • Requests: similar to transactions, it tracks the number of requests on a particular endpoint of the application
  • Errors: keep tracking of runtime errors, it’s a powerful metric for setting up alerts if the error rate increases
  • Latency and average load: the speed of your application is always an important parameter in steady operation. Keeping track on latency and average load (requests, queries, etc) can unfold issues before your customer pick the phone and call your support team
New relic’s application monitoring dashboard (source)

Gathering all of this information from your machine, container, and applications creates a valuable resource about your infrastructure. It not only helps you to create better alerting mechanism but can also be used to have a more optimized (cheaper) infrastructure.

Share it with the dev team too

As a developer, I never followed up on the performance of my code after deploying it. As long as it was bug free I was happy with it unless someone start complaining about the decrease in software speed. It wasn’t because I didn’t care about performance but because I didn’t have any information about my code performance. With modern monitoring tools such as Prometheus and beautiful dashboard that tools like Grafana provides, there is no more excuse.

We easily can access massive amounts of data about our operation and our software performance. It makes it much easier to hunt heavy queries or identify where we can catch some requests to reduce the request load.

In the end, the more you know about your infrastructure, the easier you can make effective decisions and run a reliable operation

if you are interested to know more about metrics and monitoring practices, check out Brian Brazil’s article for better insight

--

--

Amir Fatemi
Developers Writing

Software Engineer at heart, SRE guy by accident and former tech community builder