Monitoring, beyond pinging the healthcheck

Amireza Fatemi
Sep 6, 2018 · 5 min read

It’s a lazy afternoon, you’ve already started thinking about tonight’s netflix pick or maybe thinking about something low-carb for dinner. Suddenly someone start shouting in slack that your website is down! While you are thinking that maybe someone forgot to connect to the internet again, your on-call phone starts ringing! Oh boy, Not a pleasant moment, suddenly a countdown starts and turns laid back office into Gordon Ramsay’s hell kitchen. People are sending message like server is down and everyone is in panic mode. But you have to calm down and start reacting.

Obviously you start with web app server, ssh to the machine, htop first and if things are ok with resources, you dive into the logs, hoping to find an error line that points you to the culprit. But sometimes, depending on your infrastructure complexity, you have to dig into more layers. Maybe it wasn’t your app but redis crashed or maybe some bad nginx config. Whatever it is, you will find it at some point just that it adds another white hair to your head.

I was in same situation a week ago, but I didn’t ssh to any servers or even check the logs, first thing I did was opening our datadog dashboard, right away I noticed our mySQL instance is at almost 100% CPU usage, and that slowed down response time, but why? To answer that question you need to open up your application performance monitoring dashboard, where you register all query calls that your app makes (together with endpoints hits, errors, etc). Again just after loading the page, something captured my eyes immediately. Right on top of the query list, I could see one select statement with a couple of unions, joints and other stuff that can destroy your afternoon. Now you have to look at the code to see what is really happening there. Once I was there I noticed it was newly deployed code, well that is good enough, time to revert the changes back and call it a day.

Still there is one question that we have to ask, why did it happen a few hours after deployment ? we can connect the dots by looking at apm dashboard and DB instance CPU usage metrics. Exactly at the same time our CPU usage was increasing, an API endpoint was bombarded by our partner requests, of course without scaling and proper throttling, mysql instance at some point raised a white flag. Now we can create action items and avoid this situation for good.

It is impossible to troubleshoot that fast in a complex infrastructure without ongoing metrics and insight of your infrastructure. Specially when the team is under pressure due to downtime, that is why it is important to establish proper monitoring. Regardless of what tool you use for it, it is important what you are monitoring and how you can process that data to valuable information about your operation. Beyond simple health endpoint ping, we need to gather much more information not only about status of your services and components but also instance and containers that they are running inside. but what is interesting to monitor?

Hosts

Either bearmetal server or cloudbase virtual, it is always vital to keep track of your machines resources. Gathering data about your machine can help you a lot when you do capacity planning or in critical case it helps you to detect resources misuse or downtime. To monitor your machine you can look for following:

  • Resource Utilization & health : CPU, memory and disk usage of your machine, it can help a lot for capacity planning and tuning
  • Network traffic: It gives you insight about network layer of your machine, could be life saver in fast and tense troubleshooting effort
  • Process and memory usage per process: it is good to know what kind of process that you are running on your machine and usage per each process, so if next time someone runs a bitcoin mining cluster in your infra, you can kill it right away
Monitoring dashboard in Datadog (source)

Application

Beside machine, another layer that needs to be focused on is your running application itself. That kind of monitoring usually is referred as application performance monitoring (APM). monitoring applications not only helps to detect and react to incidents faster but also provide great insight about your application performance and can be used to make your application more efficient and faster. APM generally provides following metrics:

  • Transactions : information such as how many time a particular SQL query has been executed and average load time for each execution
  • Requests: similar to transactions, it tracks the number of requests on a particular endpoint of application
  • Errors: keep tracking of runtime errors, it’s a powerful metric for setting up alerts if error rate increases
  • Latency and average load: the speed of your application is always an important parameter in steady operation. Keeping track on latency and average load (requests, queries, etc) can unfold issues before your customer pick the phone and call your support team
New relic’s application monitoring dashboard (source)

Gathering all of this information from your machine, container and applications, creates a valuable resource about your infrastructure. It not only help you to create better alerting mechanism, but can also be used to have a more optimised (cheaper) infrastructure.

Share it with dev team too

As a developer, I never followed up about the performance of my code after deploying it. As long as it was bug free I was happy with it, unless someone start complaining about decrease of software speed. It wasn’t because I didn’t care about performance but because I didn’t have no information about my code performance. With modern monitoring tools such as prometheus and beautiful dashboard that tools like grafana provides, there is no more excuse.

We easily can access massive amounts of data about our operation and our software performance. It makes it much easier to hunt heavy queries or identify where we can catch some request to reduce request load.

At the end, the more you know about your infrastructure, the easier you can make effective decisions and run a reliable operation

if you are interested to know more about metrics and monitoring practices, check out Brian Brazil’s article for better insight

Developers Writing

Developers may not need to blog; but here your words are not wasted.

Amireza Fatemi

Written by

Software Engineer at heart, SRE guy by accident and former tech community builder

Developers Writing

Developers may not need to blog; but here your words are not wasted.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade