Live error tracking @redbus

Telemetry, Enhanced stack trace, APM, Tracing and more.

Amit Kumar
redbus India Blog
8 min readJan 6, 2021

--

Elastic Stack 7.10.0 new features @redBus By Amit Kumar

Why do we need Live Error Tracking?

For a B2C application, with the real world scenario, likely 2050+ Real Desktop & Mobile web browsers combinations. In order to track the performance, corner cases, and load time we need to have a live-error tracking mechanism.

Since we have a users coming from Tier-1, Tier-2 and Tier-3, adds the complexity of different browsers, devices, network combinations, hence needed a strong live-error tracking, customised for our setup.

Motives

  1. Developers to get debug information for better understanding the issue.

Information includes (device, browser, referrer, latency, headers , frequency of errors and lot more).

2. To reduce the turn around time for any production issue by 60-80%.

3. Error analysis becomes easy and we can redefine our test cases accordingly.

4. To segregate the errors for third-party and Infrastructure.

5. To capture the referrer url in order to trace the error scenario.

6. To control the level of logging to have additional information for better debugging.

Now we understand the importance of error tracking, so let’s dive into the details for this approach which we followed at redBus.

Elastic Stack 7.10.0

A good logging system is one of the very important aspects which we should take care in our project. There are multiple ways of collecting logs, information like server logs, error logs , network traffic logs etc.. and in this post we will discuss about one of the popular logging setup which we call as ELK Stack.

With microservice-based architecture at redBus, comes with the benefits in-terms of maintenance, independent releases, and scaling up systems individually, but building a centralised logging system is a challenge!

Centralised logging system, FTW!

The logs for all these systems are at different places, it can be any type of logs (application level logs, user logs, server side logs…). To find logs related to any issue means connecting the logging system of different applications, which might be of different tech stack all together at one place.

We have explored different ways to achieve the same and here we will discuss the latest approach which we at redBus has tried with Elastic Stack(ELK Stack) .

Before diving in to the details about implementation of ELK Stack at redBus, let’s first explore the internals of Elastic Stack, in brief,

What is Elastic Stack?

In generic terms, Elastic stack is a combination of different softwares that works together to provide a centralised logging system for different sets of logs. eg:- Application Logs, Docker Logs, Server Logs, Nginx Logs…etc..

courtesy:- logz.io

4-pillars of Elastic Stack

  1. Elasticsearch - An open source search and analytics engine owned by ELK.
  2. Kibana - Visualisation tool sitting on top of elastic search in order to present the data which is stored inside elastic.
  3. Logstash - A connector which helps in adding the data to elastic from different source.
  4. Beats -These are the agents which are installed at the machines from where we want to collect the logs of specific types.

Now let’s see how these pieces connect together, and work like magic!

As depicted in the picture, this is how the data flows inside this full setup. We install the different type of Beats on the server which is needed for different kinds of logs and start pushing the data to Elastic Search using Logstash.

Kibana is sitting on top of this setup where you can visualise and analyse your logs for debugging purposes.

It’s not mandatory to add Beats or Logstash as you can simply use Elasticsearch and Kibana for your logging.

Let’s explore the top features of elastic which we are using at redBus

Few of the features are explicitly launched under ELK 7.10.0 and others are from previous versions.

  1. User level data gathering,
  2. Third-party monitoring,
  3. Real-time error monitoring both at the server and the client,
  4. Anomaly Detection setup,
  5. Visualise your analytics and APM data in canvas,
  6. Custom Setup in Kibana for funnel visualisation

USER LEVEL DATA GATHERING (USER EXPERIENCE)

As you can see above, we are able to collect all type of user level information which is needed for us to collect the performance data of web application on different device , OS, and browser etc… We can have “page views” data and request distribution information at geo level. The best part of this feature is that it is real-time and helps us to detect and diagnose any issue which is happening in production.

Below is a sample of the analysing the specific page based on its url.

We can perform this analysis on any endpoint on our system in similar fashion.

THIRD-PARTY MONITORING

When you are working on a web-based application, there are a lot of third-party scripts, which are being fired at the on-load of the application, GA, GTM and Gamooga, to name a few, can be used for collecting user level information or for analytics purposes. Often need to keep an eye of these scripts in the production for ensuring the speed of the web application during initial loads(TTI).

At current stage of redBus Mobile web application, Homepage load time comparison with and without third party scripts,

This graph clearly states that the impact of third-party scripts are very high, we should have a proper mechanism to monitor it and optimise further. To achieve the same, Elastic 7.10.0 has a RUM agent.

As per Elastic Stack, Real User Monitoring (or RUM) captures the user interaction with clients such as web browsers. The JavaScript Agent is Elastic’s RUM Agent.

Below is the sample of one of the third-party script monitoring, with tracing enabled.

With the help of ELK monitoring, we have been able to suggest the enhancements and improvements to our third-party vendors.

REAL-TIME ERROR MONITORING (BOTH SERVER AND CLIENT)

One of the powerful feature of Elastic Stack, which has helped us to monitor issues at individual user devices in production and helped us in getting the details like,

  1. Browser versions
  2. OS Details
  3. Device information
  4. more…

This is a sample data collected from our production from client side, we get all the details for the browser where error is happening and what type of error including third party error at real time.

Server side errors:-

ANOMALY DETECTION SETUP

This is one of the feature which is very useful in analysing the production traffic at runtime and you can detect any anomaly from the expected range.

As you can see in the image below, there is a sudden drop at 16th November date and has an anomaly score of 61 which is a ‘critical’ category.

This helps us to analyse the changes in the expected traffic or sales and we can act upon it!

The other most important part of this detection is, Alerts!

Let’s explore the Alerts setup here,

You will get this popup when you start the datafeed for this job, on selecting, “Create watch after datafeed has started”, you can enable the alerts which will trigger if the graph goes beyond grey area.

There are different types of alerts that can be configured with elastic 7.10.0 which can be triggered based on below criteria,

Kibana has a very smooth integration to send these alerts,

VISUALISE YOUR SALES AND APM DATA IN CANVAS

There are many situations where you would like to share the business information to your manager’s, Usually it is done on excel or using a PPT. With the help of elastic, we can get these information which can be shared with the management team in canvas. The good part here is that, this would be connected to real-time backend data and will immediately reflect when is needed on canvas or dashboards.

CUSTOM SETUP IN KIBANA FOR FUNNEL VISUALISATION

There are different use cases where we need to add custom data in APM in order to make our data visualisation more accurate.

Elastic APM has provided a way to add custom labels in our logs and make our data analysis more easier.

FINAL WORDS

To conclude, I would suggest to look into details for ELK stack to go through their documentation, Elastic has a wide range of documentation with all information we need to know.

This is not the full picture of the wide range of features which ELK offers but in this article we tried to highlight the most impressive one which we use widely in redBus.

Follow me on https://amitkumar-v.medium.com/

MORE READ

--

--