Stay Ahead: Monitor Your Applications and Infrastructure

Sanket Singh
Attentive.ai Engineering
7 min readMay 24, 2024

How do you know when there’s a problem with your website or mobile app? Is it when a customer reaches out to the support team with a complaint, or do you get a phone call late at night to fix something?

Or perhaps it’s when you wake up to a harsh review on the App Store. Worse yet, there’s no feedback at all, leaving you in the dark for days or weeks until your customers start leaving you for your app's poor performance.

In this post, I will share how we monitor our application—Accelerate—as we uphold one of our Core Principles: Customer First.

Why do we need monitoring solutions?

Our customers use Accelerate day in and day out, making it an essential part of their daily operations. Even a few minutes of downtime can hamper our client’s business which can translate into loss of capital.

Ensuring the reliability and continuous availability of our service is paramount to supporting the success of our customers. Any downtime and deviation from expected business flows are extremely damaging to the trust that our customers place in Attentive.

How do we approach application monitoring and alerting?

At Attentive, the engineering team maintains nine backend services that power Accelerate. We need to find out about any issues before they happen and address them, or if not possible, find out as early as possible and contain the issues before widespread damage is done.

We split our monitoring strategy into two different parts: Infrastructure Monitoring and Application Layer Monitoring.

Infrastructure Monitoring: It focuses on foundational components of our systems, such as servers, networks, messaging queues, and databases. The metrics that we focus on for the infrastructure layers are:

Resource Utilization: Resource utilization gives us insight into how well our resources are being utilized, both for our compute engine and database. It helps identify bottlenecks that may impact the performance of our applications. We track resource utilization metrics such as

  • CPU Utilization: Monitoring CPU utilization helps us monitor the total processing power that is being used by our application. It also helps us to track down the events that spike up the CPU usage and may be a potential bottleneck in the future.
  • Memory Utilization: Monitoring the memory usage helps us in identifying the potential memory leaks
  • Disk Utilization: Disk monitoring helps us measure the I/O operations being used by our application and how much time it takes to complete a read/write operation, impacting the overall performance
  • Network Traffic helps us monitor the amount of data transmitted and received and identify packet loss due to network congestion.
  • Network Connections: It helps us track the current number of connections per database. This helps us understand the load on the database and manage database connections better.
Resource Utilization Monitoring Dashboard

Pub/Sub Metrics: To ensure the health of our Pub/Sub system, we monitor metrics like

  • Publish Message Count: It helps us monitor the topic length and identify if messages are processed timely or not. We can also easily identify if we need to scale the subscribers to handle the load.
  • Average Message Size: Larger message size can impact the processing time and memory usage; monitoring this lets us identify if we need to reassess the message payload
  • Oldest Unacked Message Age: This indicates a delay in message processing, and its high value may mean that subscribers are overwhelmed or not keeping up with the message rate.
  • Publish to Ack Delta: High delta time indicates processing latency, which can be a sign of saturation in the subscriber processing pipeline.
GCP Pub/Sub Monitoring Dashboard

Application Layer Monitoring: Application monitoring zeros in on the performance and reliability of our backend services. We measure

  • Latency: Monitoring latency helps us identify how fast our systems are. It is an important metric because it directly impacts user experience.
  • Throughput: It signifies the rate at which the requests are processed by our backend services. It helps us understand the volume of incoming traffic.
  • Transaction traces: It provides detailed insights into the performance of individual requests. It allows us to identify the potential bottlenecks that affect a significant portion of our users.
  • Error Tracking: Error tracking is essential for maintaining the reliability of backend services. Real-time visibility into exceptions allows us to identify and resolve issues promptly, minimizing downtime.

What tools are we using?

We use a combination of third-party, open-source, free, and paid tools to accomplish our objective. We perform our due diligence to evaluate all of the tools available, as there are so many options to choose from.

  • GCP Cloud Monitoring Metric, Alarm, and Dashboard for all of our GCP-hosted services.
  • New Relic APM for application performance. It Provides transaction traces for white-box monitoring and pinpoints any slow-running queries that we can continually improve on. Before choosing New Relic, our team evaluated other options like Grafana + Prometheus and Datadog, ultimately selecting New Relic because it met all our requirements.

Distributed tracing feature of NewRelic helps us to track and observe requests as they flow through distributed systems. Requests pass through various services to reach completion, and these services could be in a variety of places: containers, serverless environments, virtual machines, different cloud providers, or on-premises. We can see the path of an entire request across different services, and can quickly pinpoint failures or performance issues. Below screenshot shows one of our web transaction which pass through 4 different microservices and databases

  • Sentry for error monitoring. It provides instant visibility of errors and exceptions occurring in our application.

Below are some of the screenshots of some of our monitoring dashboards.

Accelerate NewRelic Summary Dashboard
Sentry Alerts for errors popping up in our application

How monitoring helped us scale and optimize performance

We were observing slowness in our system, and we leveraged NewRelic's detailed transaction traces to identify and pinpoint the exact APIs that were causing the bottleneck.

  • We identified the slowest APIs by adding the filter of “most time-consuming APIs” in NewRelic.
NewRelic Transactions
  • The traces highlighted specific slow running queries, with this granular insight our engineering team was able to optimize the problematic APIs, streamline database queries and refine our codebase.
Transaction Trace
  • As a result, we significantly improved response time, enhancing the overall speed and efficiency of the application. In one of our APIs where we were able to reduce the response time from ~ the 40s to ~2s, a reduction of 95%
Before Optimization — Response time ~40s
After Optimization — Response time ~2s

Even after reducing API response times, we struggled with performance issues during periods of high load. To diagnose this, we looked into our infrastructure monitoring tools.

Through detailed metrics, we identified that CPU utilization was consistently hitting its peak during these high-load times, causing system slowdowns.

GCP Monitoring Dashboard that helped figure out CPU Usage peak

We recognized the CPU as the bottleneck and increased its capacity by scaling up our server instances. This enhancement provided the necessary processing power to handle the increased load, resulting in smoother performance and improved reliability under heavy usage.

Conclusion

Having visibility into these metrics provided us with the starting point and baseline on which to continually improve.

I hope this post provides you with some insights on using a combination of free and paid tools to monitor your applications and what metrics to focus on.

Regardless of your industry, the first step is always to identify what is important to your customers and list the key metrics to monitor. Then, conduct an in-depth analysis of the tools you’ve shortlisted.

I am sure you have your own ways of monitoring and alerting, and I’d love to learn how you’re doing it. If you have any suggestions or questions, feel free to share them in the comments below.

--

--