How we monitor the health of our applications and infrastructure

nic_ngoo
Kaodim Engineering
Published in
10 min readMar 5, 2019
Photo by Jair Lázaro on Unsplash

How do you know when there is something wrong with your website or mobile apps? Is it when a customer complains through customer support? Or is it when you wake up one morning and see a nasty review left on App Store. Or worst still, no news so you just go days or weeks without finding out your application is not working as it should be and only find out when you look at the balance sheet at the end of the month and you’re bleeding money.

In this post, I will share what and how we monitor our applications at Kaodim as we uphold 2 of our Core Principles: Obsess Over Every Detail Of The Customer Experience, and Hold Ourselves To The Highest Standards Of Quality & Performance.

Why do we need to invest in monitoring solutions?

At Kaodim, our business is not only serving the end customers that use our web and mobile applications to conveniently search and book a variety of services, but also to serve the service providers (or vendors) that will receive and complete those services offline. We have presence in Malaysia, Philippines, Singapore and Indonesia serving thousands of transactions per day.

The Kaodim platforms are available as kaodim.com, kaodim.sg, gawin.ph and beres.id for web, Kaodim User and Kaodim Vendor apps for both Android and iOS in Malaysia and Singapore; Gawin User and Gawin Vendor apps in Philippines; Beres User and Beres Vendor apps in Indonesia.

Our customers on web are interacting with 2 applications — the landing page for searching and booking of services and the other one is the dashboard application to manage service requests and accessing help center etc. Additionally there are 4 backend services that has to be monitored and supported as well.

Table summarizes all of the applications for Kaodim that need to be monitored

So in total, our Engineering team has to look after 4 backend services, 8 web applications and 12 mobile applications, although the code bases are the same for each application across all 4 countries except for localisation files. However we still monitor all 12 version of our mobile applications because we found crashes that occurred in certain country only.

Any downtime and deviation from expected business flows are extremely damaging to the trust that our customers and service providers place in Kaodim, not to mention our ability to keep growing.

How do we approach application monitoring and alerting?

Our engineers need to find out about any issues before they happen and address them, or if not possible, find out as early as possible and contain the issues, before widespread damage is done.

Here are some of the principles that we are introducing at Kaodim Engineering to ensure effective monitoring and actionable alerts, taking inspiration from 2 publications on this subject — Google Site Reliability Engineering Book and Best Practices for Setting SLOs and SLIs For Modern, Complex Systems.

  • We need to monitor for general availability (uptime) of all of our applications as well as issues that contribute to critical business functionality loss
  • Monitoring should allow us to see the trend as well as narrow in on specific issues for troubleshooting
  • Alerting should have as little noise as possible so we don’t fall into the trap of ignoring false-alarms
  • Alerting tools should be automated and the alerts message are specific so that engineers are only alerted when something is really wrong they don’t need to take a long time to find out exactly what’s happening
  • We cannot possibly monitor every single events so it is important to truly understand what is critical for your business and only spend resources to monitor the critical events
  • We set warning thresholds and alert engineers so that they find out about impending issues before they actually cause business loss
  • We use dashboards and time series graphs to spot trends and also set outlier alerting for any spikes that are out of the ordinary
  • Every engineer has incident response and resolution as part of their yearly KPIs. We evaluate their ownership in actively addressing production issues.

What are we monitoring and alerting on?

Prior to this, there were already error alerts and monitoring tools in place, but there are obvious gaps where some of the services are not being monitored and we did not have a common Service Level Indicators defined. So we took a step back and decided that we had to revisit our monitoring and alerting strategy again.

The first step for us was to determine all the metrics that are important for us to tell the health of our applications. We followed Google’s strategy of combining black-box and white-box monitoring: black-box monitoring are looking on the boundaries of our application as a whole, such as system is down or not working correctly. White-box monitoring will allow us to look deeper into the applications for imminent problems such as slow running queries or logs showing repeated retries.

Google’s recommendation is that the Four Golden Signals are minimum of what needs to be monitored — Latency, Traffic, Errors and Saturation. But we wanted to have more metrics as starting point and also specific ones to our web and mobile client side platforms. Below are the key black-box type metrics we looked at to give us a starting point.

Why we use the tools that we use?

We use a combination of 3rd party open-source, free and paid tools to accomplish our objective. One of the decisions we had to make as a lean startup is build vs. buy decisions. We try to be up and running as quickly as possible without taking valuable engineers time to build custom monitoring from open-source tools.

As a startup on hyper growth stage, engineers time is best spent on building new features that benefit our ecosystem. But as a team we also perform our due diligence to evaluate all of the tools available out there as there are so many options to choose from.

Some of the tools like Slack are being used by the rest of the company, while AWS Cloudwatch is the easiest way to monitor our AWS resources where the majority of our production workload are hosted. While Firebase is not only one of the best but also free and its SDK easily integrated into our mobile builds.

However we made the decision to invest considerably in New Relic, one of the industry leaders in application monitoring SaaS. It’s simple to set up, powerful and support for multiple platforms is one of the reason for our decision. Having our backend, web and API monitoring under one tool keeps the management easier.

Alerting channels

  • Slack where all of our alerts go at this moment. All engineers have Slack on their mobile devices and required to enable notifications for all critical alerting channels. We divide our slack channels into critical and non-critical channels for historical auditing purpose
  • Email notification is considered a secondary alerting channel but still active
  • Others. There is future consideration for SMS alerts and paging tools like PagerDuty for Highest Severity events, but so far Slack works thanks to the discipline of our engineers responding.

Monitoring Tools

  • AWS Cloudwatch Metric, Alarm and Dashboard for all of our AWS-hosted services
  • New Relic APM for application performance and error monitoring. Provides transaction traces for white-box monitoring and alerting as well as pinpoint any slow running queries that we can continually improve on.
  • New Relic Synthetics for API performance monitoring and ping tests for uptime. Our API monitoring uses a test script to send a request and for the tests to pass, not only HTTPS 200 response is required but we have also configured New Relic to look for correct response content.
  • New Relic Browser for web client performance monitoring and errors
  • Google Page Insights to analyze our landing page and dashboard web pages on demand
  • Firebase Crashlytics to report on mobile app crashes and errors
  • Firebase Performance to get insight into our mobile app performance
  • Raygun is a tool we use considerably for error reporting of our Ruby on Rails backend. We find that it provides a better error alerting than New Relic APM so we’re keeping this.
  • Nagios is an open-sourced monitoring tool we put in place since the early days to monitor host-level processes and network connectivity
  • Monit for monitoring and automated keep-alive of specific services such as Sidekiq queue job processing and Phusion Passenger. Alerts will be sent if there is any error

Below are some of the example screenshots of our monitoring dashboards. The dashboard and alerting notifications are probably the most time consuming to setup other than the first step of figuring out the important metrics to measure.

New Relic Synthetics monitor for API and Ping tests
New Relic APM monitoring of our main Backend service
New Relic Browser monitoring of our web pages
PostgreSQL RDS monitoring on AWS Cloudwatch Dashboard
Main Backend Service EC2 Instances monitoring on AWS Cloudwatch Dashboard
AWS Elasticache Redis monitoring on AWS Cloudwatch Dashboard
Firebase Crashlytics monitoring for mobile applications

These dashboards are displayed on a large TV in the Engineering standup area to ensure everyone has visibility and aware of what’s happening with our applications. Every morning during standup, it is impossible to miss the graphs if there is a spike, something is showing red or crash-free users have dipped below our threshold.

Kaodim Engineering Dashboard

The alerts are configured to send Slack messages to alerting only channels. We use a combination of custom webhooks and native integrations to achieve this. Tools like New Relic, Firebase and Raygun offers native integration, while you need to do a little bit of work for AWS and open-source tools. How exactly we configure these tests and alerting policies is a subject for future posts but feel free to reach out to me if you’d like to learn more.

The example below shows our New Relic alerting Slack channel where GET 5xx errors above critical threshold are posting messages.

You can create a custom Lambda function to send Cloudwatch alarms to Slack. This guide from Slack shows you how. The example below is our ‘warning’ alert when our Elasticsearch JVMMemoryPressure goes above 65%, and engineers are supposed to pay attention before functionality is impaired.

Final Words and Next Steps

As you can see, having the visibility into these metrics of our application provides us with a starting point and baseline to continually improve on. Quoting father of management consulting Peter Drucker, “If you can’t measure it, you can’t improve it”.

Hopefully this post gives you some idea on how to use a combination of free and paid tools to monitor your applications and what to monitor. No matter what industry you’re in, the first step is always to understand what are the important things for your customers and list down the metrics to monitor. Then perform in-depth analysis of the tools you’ve shortlisted. All of these tools will provide you with a free trial period so you can install their SDKs, JS snippets and setup alerting. This gives you a good sample period to play around and weigh the pros and cons.

Next step for us is to take the last 30-day data on each metric to establish our Service Level Objectives (SLOs) on what to expect out of our applications. Having SLOs will help us to keep raising the bar of our performance with our customers in mind.

We are also looking to evaluate New Relic Infrastructure and Mobile to see if they are worth the ROI and consolidating all under 1 tool. The nice thing about having everything in New Relic is the power of Insights which allows you to build an all-in-one dashboard showing everything in a single pane of glass.

I am sure you have your own ways of monitoring and alerting and I’d love to learn how you’re doing it. If you have any suggestions or questions, I’d be happy to hear them so drop me a message at nic@kaodim.com.

The Kaodim Group consists of kaodim.com (Malaysia), kaodim.sg (Singapore), gawin.ph (Philippines) and beres.id (Indonesia).

The Kaodim Group is the #1 services marketplace in Southeast Asia, providing a faster, more dependable way to hire professional services you need from plumbers, photographers, cleaners, movers, caterers, wedding planners, yoga instructors and many more.

We are transforming small and medium sized businesses like never before. We receive thousands of requests for their services on their smartphones, tablets and computers everyday, allowing them to make an instant connection with new clients at an unprecedented rate.

Wanna be a part of our awesome team? Discover excitement! Join the Krew: http://careers.kaodim.com/

--

--

nic_ngoo
Kaodim Engineering

Tech Leader | IoT and Serverless Tinkerer | Ex Amazonian