Improve your business with Observability on AWS

Published in

Storm Reply

9 min readApr 4, 2024

As technology keeps changing, it’s very important to understand and keep an eye on complex systems more than ever: that’s where observability comes in. It helps us figure out not only how software works on the inside but also to understand what’s happening in your entire system and among the technologies. In this article, we’ll explain observability in simple terms and talk about why it’s so useful in today’s digital world and how it is possible to implement it in AWS. Whether you’re already familiar with tech stuff or just curious, stick around as we explore how observability can make a big difference.

What is observability

Observability means being able to observe and understand how a system operates by examining its external outputs, such as logs, metrics, traces, and events. In the context of software development and system architecture, observability is essential for efficiently monitoring, debugging, troubleshooting distributed systems and get evidence of what is the user experience in term of performance and usability. This importance is further emphasised in today’s landscape, where systems are increasingly complex, services are containerized and they operate on cloud infrastructure. Coupled with the accelerated pace of the software delivery process and team fragmentation, observability becomes indispensable in every system.

Why observability?

If you are thinking why a company should invest in observability, we have multiple answers for it:

Business Continuity: Observability automates monitoring of business Key Performance Indicators (KPIs), leading to faster incident resolution and higher user satisfaction. It shows current business status, increases sales and meets SLA commitments.
Cost Optimization: By identifying and eliminating resource waste, observability reduces cloud infrastructure expenses leveraging on scaling and overprovisioning.
Competitive Advantage: Observability allows organizations to react swiftly to market changes and customer needs, fostering rapid innovation and service improvement. This agility can expand market share and revenue by reducing time-to-market.
Enhanced User Experience: Smooth-running applications and services enhance user experience, directly impacting business value by increasing customer satisfaction and loyalty.

Observability Pillars

Traditionally, the industry defines observability as logs, metrics, and traces but in more complex cloud environments, observability can include also metadata, user behaviour, topology and network mapping, and access to code-level details.

Metrics answers to “Do I have a problem?”. Can include various quantitative measures such as response times, error rates, throughput, and resource utilisation.

Logs answers to “What is causing the problem?”. They are textual records generated to keep an historical track of important events, errors, or messages to troubleshoot issues, debug code, or audit system behaviour.

Traces answers to “Where is the problem?”. They are a sequential record of events providing a detailed timeline of operations performed, that are helpful in systems made of lots of smaller parts (like microservices) because requests can go through many different components.

Observability on AWS

AWS provides all the observability tools needed to cover the pillars described before.

There are two different approaches you can follow to create observability in AWS Cloud: outside-in and inside-out. The first starts from the Digital experience and then moves into application and infrastructure issue, on the other hand, the inside-out follows the opposite flow.

Infrastructure: it is necessary to observe the status of the infrastructure components, for example the CPU or memory consumption, the disk or network usage or the execution time of a Lambda. For this purpose, AWS provides tools like CloudWatch logs, metrics, monitor and dashboard to quickly visualise this kind of data.
Application: To inspect deeper the service, we need application logs and data. AWS provides tools like AWS X-Ray insights, CloudWatch ServiceLens, Container Insights, Lambda Insights, Application Insights to achieve this goal. Within these services is possible to collect and analyse traces, logs and metrics coming directly from the application.
Digital Experience: To analyse and observe the digital experience of a user, AWS provides tools such as CloudWatch synthetics, to perform some simulated tests of API calls, CloudWatch RUM (Real User monitoring) to inspect the real behaviour of final users, and CloudWatch Evidently that allows developers to introduce experiments and feature management in their application code.

Observability and SRE Practice on AWS

SLA, SLO, SLI are parameters used in SRE practice to measure how much a system is available and reliable and they can be applied to different levels.

SLA (Service Level Agreement) normally involves a promise about availability to someone using your service. For example, the promise that the application will be up and running at 99.99% of time.
SLO (Service Level Objective) are measurable targets that you help you ensure service quality to your customers. For example, the reactivity of a system, or error rate on a service or also the percentage of budget respected.
SLI (Service Level Indicator) are a direct measurement of your service’s behaviour, for example the effective time that the application has run or the number of errors.

To meet these KPIs, some metrics can be used to monitor the health of the entire system, these are Golden Signals.

Latency: the time between the request and the response
Traffic: the total number of requests across the network.
Errors: the number of requests that fail.
Saturation: the load on your network and servers.

All these concepts are replicated on AWS, which allows to configure and monitor them. If you go in CloudWatch, you can find a dedicated section to SLO under the Application Signals page. From here you can configure your own SLO and SLI based on service operation or CloudWatch metrics.

Example of observability in AWS EKS cluster

Let see how we can implement observability in an AWS account and specifically in a EKS Cluster following the inside-out approach previously described.

Once the EKS cluster is configured, we can install “Amazon CloudWatch Observability” add-ons, which is a plug-and-play feature for the service, with few clicks selecting the cluster, go to the “Add-ons” tab and search for “Amazon CloudWatch Observability”, in this way is possible to collect data from the cluster.

2. Once installed automatically AWS enables Container Insights where we have build-in dashboard with the main metrics and logs for our containers. You can find them under “Dashboard” section in CloudWatch service and then select “Automatic Dashboard”.

3. Now we must go deeper on application level with CloudWatch Container Insights that implements cluster, node, and pod-level metrics and logs. When you deploy services inside your cluster, you can also see the map of these services, and all the relation between them. And all of this is automatically generated for you!

*Service map is available under the Application signals section in Cloudwatch.*

Moreover, you can exploit the potential of AI to inspect and analyse logs. Now in region US East (N. Virginia) and US West (Oregon) you can write queries in human natual language to filter logs and retrieve the most relevant information by writing a simple sentence!

4. Finally, we have to create visibility for the customer digital experience defining canaries and SLOs. Amazon CloudWatch Synthetics allows the creation of canaries, which monitors endpoints and APIs to simulate customer behaviour, this means you can keep checking how the customers experience is even without customers using it, this can facilitate the detection of potential issues prior the customer faces it. Service Level Objectives are crucial for tracking your service’s performance over time and ensuring they meet your expectations, particularly if you have agreements with customers (SLA). Application Signals simplifies the process by automatically collecting metrics like latency and availability, which can be used as SLIs.

How we can troubleshoot an incident with AWS Observability

We saw how we can implement an observability platform leveraging on AWS services, now we are ready to understand how to use it to troubleshoot a real issue.

Imagine you provide a digital service for veterinary purpose. Using your service the clinic can manage all the visits, view passed ones, see and manage the list of doctors working in the clinic and register and manage a new pet. If the service is not working, the entire clinic cannot move on in in daily activities, so the IT chief decides to implement observability on AWS.

Example 1 — From Service MAP

First of all is possible to go under the CloudWatch service to have more information about the status of all services in the system.
In “Service Map”, under “Application Signals” section, you can see the relations between the different microservices. For each service, simply by clicking on it, is possible to see golden signals, the request volume, the error rate, unhealthy checks and more. So, from here you can see if there is some kind of issue in your application.
If you have a SLI unhealthy for a particular service, you can click on the name of the failing SLO to go deeper analyse the issue.
Under the SLO section you can see the operation that is causing error. By clicking that, it is possible to have more details like latency, errors and faults.
Clicking a point in any graphs you can see the correlated traces.
You can see the traces that are failing, click on it and see all the spans of the trace. From here you can analyse exceptions, stack traces and error messages to use to identify the root cause.

Example 2 — From User Experience

You can check if some APIs are failing and to intercept issue before the users, go to CloudWatch service, open “Syntetics” under “Application Signals” section.
Here you can see which syntetics are faling and you can inspect it to get insights of which step can cause issues. You can see all the “Steps Executed” and check the one that fails.
In “trace” tab you can see the trace that is faling and click on “Go to trace map” button you can see all the spans of the trace.
Selecting the one that fails, it is possible to explore the “Exception tab” where you can find error message and stack Trace.

CONCLUSION

We learn about what is Observability, what are the key feature, why a company should implement it and which are the main values that brings with it, both on technical side and business side.

Moreover, we saw together how Amazon Web Services can help organizations troubleshoot their application, improve performance and increase the user experience leveraging on cloud native services integrated within your workload.

I hope that this brief chapter can help you in the implementation of your Observability roadmap, and if it has piqued your curiosity feel free to write in comments some topic you would like to explore further!