The Core of Providing Site Reliability: Observability

Steven Sim
10 min readApr 9, 2022

--

The Cloud (source: Unsplash)

With the advent of Cloud Computing and the Internet in general, most of the time servers are no longer provided by on-premises systems unless there’s an explicit requirement to do so. But the question is since it’s on the cloud and you don’t even know where is it located, how are you going to keep it up and running always? and most importantly how do you know whether your application is healthy or not? Let’s go through with the three most important pillars in observability

The Pillars of Observability

The Pillars of Observability (source: Dynatrace)

Let’s get started with a simple question. “What is currently happening on my application?” which is probably the first question you are going to ask when you encounter a problem or an error in your application. But these 3 pillars of observability can help you answer on what is happening on your app.

Logs

Log (source: Unsplash)

Logs can be defined as “notes” on what your application has done, it can be as simple as print("registering new user") or using complex tools such as a logger library (Log4J, LogBack, etc.) for you programming language. This is probably the first one you should check whenever you encounter an error in your application. This means your application should also be well-logged (as in not too much or too less logs). A good self-check question that can assure you whether there’s already enough logs already is “Does the log I’m currently seeing helps me enough in knowing what’s happening with the application?”. If you say yes then it’s probably already good enough.

The most important logs that you should always save are error logs, but make sure it’s not leaked anywhere to keep your app safe! For example in my SE project, logs that are tagged to be ERROR had to be explicitly defined to be logged out in production mode (you can see our full logging settings here). These logs will then be saved in journald in our server for future viewing using journalctl.

In our real case for our Software Engineering Project course, logs have helped us alot in providing visibility on errors. For example we faced an unknown error when accessing the Student Activity Log feature in our application. The first thing we did was to check the logs and fortunately for us we found something very valuable. (You can see the complete story of it in here)

The logs told us that a migration did not run properly as it was missing a table.

From there we knew exactly what the problem was, turns out that there was a database migration that failed before that also went unnoticed. Therefore we checked the pipeline that ran the migration and made needed fixes.

But to be honest, using CLI sucks and is very hard to do. Also, giving SSH access to every developer that needs is a security hole. Good thing is that there is already a solution which is a logs visualizer, but before that, we must also know the concepts of logs forwarder and logs database

  • Logs Forwarder: The application that runs on your main app’s VM to scrape and process logs to then forward it to a given database or another log forwarder (e.g. Filebeat, Fluentd, etc.)
  • Logs Database: The database that stores all the logs data, this database is specially made to be fast and easy when querying time series data. (e.g. Elasticsearch)
  • Logs Visualizer: The frontend application to access and search through logs that is saved in a logs database (e.g. Kibana)

With that in mind, we’ll use Filebeat as a log forwarder to save logs inside Elasticsearch which will then be queried through Kibana for visualization.

First, let’s create a Filebeat configuration file to set up

As you can see, we first take the logs from a Systemd service called siskripsi_app.service to be processed (which include renaming 2 fields and excluding all fields other than the specified 3). Lastly we will input them to the specified Elasticsearch database.

Now let’s look what the logs look like in Kibana (I’m using 8.2.0)

What our logs look like in Kibana

You can see that we have now access to the powerful search feature in Elasticsearch, let’s try to search some errors

Search results for “Error”
Surrounding logs on one of the error logs (this is a feature on Kibana)

You can see that this is a lot better than just using journalctl since everything is done through a GUI instead.

Be aware also of the essence of the logs you are exporting (access logs?, custom logs?, error logs?, debug logs?, etc.)! since back with the original question, too much logs or too less logs will only be a problem later on when you really need to look for information!

Metrics

Human “Metrics” using an Electrocardiogram (Shutterstock)

Metrics can be defined as time-series data that tell you numbers on what have happened or what is happening in your application. Some of the metrics that you should monitor for your application are:

  • CPU usage
  • Memory usage
  • Disk usage
  • Error rates, etc.

A good analogy is a heart rate monitor used in hospitals, if a patient’s heart rate goes below a certain threshold that means an immediate act had to be performed. Same goes with your application, if your server suddenly has 99% CPU usage, then you should immediately find out what’s currently happening in your application by using logs, dumping thread data, etc. since it can be destructive to your availability.

One of a good way to see these metrics are through CloudWatch if you use AWS or Google Cloud Monitoring on GCP. For example our Software Engineering Project course’s staging server metrics as of the writing of this article can be seen like this

VM Metrics shown in GCP’s built-in Dashboard

But we are still missing the part where we get alerted when a certain threshold has been reached. Of course we wouldn’t want to check our metrics every 10 minutes in our life. Apply the Hollywood Principle instead, “don’t call us but we’ll call you”. We can do this using Grafana, the high level idea is to export the metrics from GCP or AWS to Grafana and have Grafana check it for us every certain period to alert us if a certain threshold is met.

Let’s start with how the dashboard itself looks in Grafana after we export the live data from GCP

Grafana Dashboard with the same data as before

Now as an example I will set an alert so that whenever the memory usage reaches a certain point then it will send me an email telling me that the memory usage is already unacceptable and needs manual handling. We’ll set it to 40% for example to generate an alert for viewing (in real cases, you should be alerted when the usage reaches around 90% instead).

Grafana settings to set up an alert

Explanation: The settings are set to evaluate the Memory Usage Percentage every 1 minute and will alert me directly to my email whenever the average of the memory usage is above 40% (for 0m means to immediately alert instead of waiting whether the high memory usage persists for a certain time or not)

Email I received from the Grafana App about the alert I set up before

As you can see, Not only we monitored the systems but we are also alerted through email for example. Although there are much better ways to do this than an email for example using PagerDuty (this is a paid service) to also send you an SMS or even a phone call when a certain time has passed without anyone acknowledging the event.

Tracing

Bottleneck in real life (source: Integrify)

Nowadays most of the applications are based on the web, which means data has to travel through computer networks. And as some of you should have known, There are a lot of delays in transmitting data through the network. This brings up a common question which is

What’s causing this part of my application to be very slow?

Of course the most primitive approach is to use an internal timer to count the elapsed time between the end with the start of an operation. But is there a better way to do this? I mean looking at gazillion logs of just numbers of elapsed time is just boring and the bother needed to set up timers in your code is just another extra effort.

Luckily we now have something called auto instrumentation tracing which means you’d only need to write a view lines of code at the app’s bootup and you’ll have a complete data of what’s taking so long in one request, and also there’s a good tracing visualizer UI that you can use.

In this case we will use these tools for creating traces in our SE project:

  • OpenTelemetry: as the tracing data generator in our Django Web App, specifically we’re going to use the auto-instrumentation variant
  • Jaeger Collector: as the tracing data collector to be visualized in the UI later
  • Jaeger Query: as the tracing data visualizer

Here’s a good visualization of how tracing architecture is done w/ OpenTelemetry and Jaeger

OpenTelemetry and Jaeger Architecture (by Yuri Shkuro)

Installing OpenTelemetry in a Django Web App is very simple, all we have to do is install the required libraries through pip and then just inject a code in the manage.py file. Since we are also using PostgreSQL in our web app, we are also going to instrument the DB requests to make sure we get the tracing data for DB too. (You can see the full installation guide here). Roughly these are the only changes we did to install traces:

Django’s manage.py with OpenTelemetry instrumentation to Jaeger

With that, let’s generate some sample tracing data by using the application normally, so we can see an example of a breakdown of a single request to our app. For demonstration purposes, we are going to use a very slow database connection by using a database with low resources and located in a faraway region from the server itself. After trying it out myself, I found a potential question for myself

“Why is the main page very fast before I log in, but after logging in every thing seems to be very slow?”

And if we compare the tracing data for these endpoints:

  • / path, which doesn’t use any DB queries
  • /auth/login path, uses DB query to save the session and also to search for credential data
  • /log/ path, uses DB query to get Activity Logs for the logged in student, which also needs a DB query

You can see that the obvious reason why the requests for /auth/login and /log/ path is slow is due to the DB connection.

Traces generated for the / path, Only took around 10ms for the request to finish.
Traces generated for the /auth/login path, You can see that there’s a huge connection delay to even start the DB queries, and the time took for the DB queries are already unacceptable. The request took 5 seconds to finish.
Traces generated for the /log/ path, Although better than /auth/login, 3 seconds is still too long for it to be acceptable.

As you can see, there is a huge delay for our server to even reach the DB server, and the since the DB server itself is very slow, the DB queries also took a long time to finish.

Therefore one of the solution is to move the DB into a closer location with better resources, and here’s a demonstration after moving the DB server

New traces for /auth/login path after moving the DB to a better location and with better resources. Process only took 200ms with visibly faster queries
New traces for /log path, you can see now it only takes 45ms for the whole request to be finished.

Extra: Effective Incident Management

Just like the 3 pillars of scrum, which are transparency, inspection, and adaptation. Having an observability system without being aware of it is useless. From what I have shown in the metrics part there is a built-in Grafana feature of email alerts, but based on my experience, most people don’t really turn on their notifications for their own work email.

But most importantly if you have a team, who is going to manage all these incidents? If the whole team is in-charge, then it will have a bad impact on development velocity. One of the answer is to have an on-call schedule rotation. “On-call” is simply just who is in-charge to manage the incidents if something were to happen, this means anyone that is not an on-call can focus on development rather than being flustered by incidents.

One of the good ways to do this, is by creating a weekly on-call schedule rotation that rotates the on-call PIC to a different team member every week, with that at least it’s already clear on who’s in charge of discovering and looking more into an incident if it were to happen.

In a popular incident management tool “PagerDuty”, you can set on-call rotations based on the time, so that only the PIC will get notified ASAP.

Conclusion

“Ponder and deliberate before you make a move”, said Sun Tzu in the Art of War. This also applies to providing site reliability as correct action is needed to make the correct repairs needed. One of the ways you can understand the situation is through the three observability pillars.

--

--

Steven Sim

Site Reliability Engineer, always interested to learn more about DevOps and Cloud Computing