Observability in Stocks

Our application that allows stores to check the physical stock of products and perform actions related to provisioning

Alejandro Capdevila
Mercadona Tech
7 min readMay 3, 2023

--

Introduction

This article explores the concept of observability and its importance in application monitoring. Using the Grafana platform as an example, the article discusses how tracing and logging can be used to provide insight into application performance and identify issues that may be impacting performance or reliability. This publication also offers a step-by-step guide for setting up a custom Contextclass to manage trace_ids and using decorators to log function executions automatically. Finally, the article discusses the benefits of using tools like Prometheus and Grafana to collect and visualize application performance data in real time, enabling teams to proactively identify and address issues before they impact users.

So… What is Stocks?

Stocks is an application that allows stores to check the physical stock of products and perform related actions. Unfortunately, the database is not under our control; another department of Mercadona manages it. Consequently, we rely on API calls to access the information we require. This means that the flow of information is as follows:

V6 → MERCADONA → MERCADONA DB → V6.

Orchestration

A bit of context

Our goal here is to have information on the state of our application. To do so, the first step is to start tracing all our application’s actions. A trace is a series of connected spans representing a single logical operation. And a span represents a single unit of work within a trace, such as a server request or a database query.

By collecting and analyzing those traces, you can recreate the state of your application at a specific moment and understand how different components of your application interact with each other, identifying issues that may be impacting the performance or reliability of your application.

But be aware that tracing adds overhead to your application. Collecting traces requires additional processing and storage resources, which can impact the performance of your application.

Tracing requires planning and coordination. To get the most value out of tracing, you must carefully plan which spans to include in your traces and how to instrument your code to generate those spans. You must also coordinate with other teams or systems that may generate traces in your environment.

To the code and far beyond

Now that we know the theory let’s get down to work. To link all the different spans in a trace, you can use a unique identifier for each trace called trace_id. One way to do this is to use a middleware that processes all the requests that our API is receiving and sets a trace_id header on the request:

from uuid import uuid4

class TraceIdMiddleware:
def __init__(self, get_response):
self.get_response = get_response

def __call__(self, request):
request.headers['trace_id'] = str(uuid4())
response = self.get_response(request)
return response

The next step is to check if the request already provides a trace_id header and use it if it exists:

from uuid import uuid4

class TraceIdMiddleware:
def __init__(self, get_response):
self.get_response = get_response

def __call__(self, request):
trace_id = request.headers.get('trace_id')
if not trace_id:
trace_id = str(uuid4())
request.headers['trace_id'] = trace_id
response = self.get_response(request)
return response

One problem here is that we are delegating all the responsibility of managing that trace_idto the middlewarewhen it shouldn’t. Also, we are setting the trace_idinto a request header. That means that if we want to make use of it, we must pass that request from one layer of our application to another.

As we are kind of lazy coders, adding that overhead to all of our functions seems a bit of an overkill (and also a really tedious work). Instead, we can create our own custom Contextclass that has all the logic to manage the trace_id, and makes use of the threading.localobject to store that id during the lifetime of a request.

from uuid import uuid4
import threading

local = threading.local()

class Context:
@staticmethod
def set_trace_id(self, id):
trace_id = id
if not trace_id:
trace_id = str(uuid4())

local.trace_id = trace_id

@staticmethod
def get_trace_id(self):
return local.trace_id

class TraceIdMiddleware:
def __init__(self, get_response):
self.get_response = get_response

def __call__(self, request):
Context.set_trace_id(request.headers['trace_id'])
response = self.get_response(request)
return response

And finally, if you’re using a Django app, you just have to do these final modifications:

from django.utils.deprecation import MiddlewareMixin

class TraceIdMiddleware(MiddlewareMixin):
def __init__(self, get_response):
self.get_response = get_response

def process_request(self, request):
Context.set_trace_id(request.headers['trace_id'])

def __call__(self, request):
response = self.get_response(request)
return response

And don’t forget to set up your middleware class in the Django settings file:

MIDDLEWARE = [
# ...
'myapp.middleware.TraceIdMiddleware',
# ...
]

This will enable the middleware for all requests in your Django project.

Using the trace_id

Okay, now we have a trace_id for all requests that go through our API, but we have to use it somehow. One possible solution is for our loggerto use that trace_idevery time we log anything. If you are making use of the logging library included in Python, this is how to set up a filter to make use of the trace_id:

import logging
from context import Context


class TraceIDFilter(logging.Filter):

def filter(self, record):
record.trace_id = Context.get_trace_id()
return True

And then, add this filter to the logger config. A very basic config looks like this:

from logging import config

config.dictConfig({
# ...
'filters': {
'trace_id': {
'()': 'src.my_app.logging_filters.TraceIDFilter'
}
},
'handlers': {
'console': {
'class': 'logging.StreamHandler',
'filters': ['request_id'],
},
},
'root': {
'handlers': ['console'],
'level': 'WARNING',
},
# ...
})

And you’re good to go. All the logs from the same request will share the same trace_id.

Let’s start logging something

So now it’s time to use our trace_id everywhere. One of the first things that can be done is to start logging the start and the end of the execution of our functions. Something like this:

def my_function(*args, **kwargs):
logging.info('Starting my_function execution')
# code goes here
logging.info('Finished my_function execution')
return result

And automatically, those traces will include the trace_id we have set up before. But that would mean we should manually add those traces to all our functions, views, actions, etc. So here is where de decorator pattern comes to the rescue:

import logging
from functools import wraps

def logged_method(func):
file_name = module.split('.')[-1]
func_name = f'{file_name}.{func.__qualname__}'

@wraps(func)
def wrapper(*args, **kwargs):
logger.info(f'Starting {func_name} execution')
result = func(*args, **kwargs)
logger.info(f'Finished {func_name} execution')

return result

return wrapper

And that’s it. Now we just need to decorate our functions:

@logged_method
def my_view(request):
return HttpResponse('Hello, World!')

And the output will look like this:

Trace ID: 8b56f1d4-4c4a-4ba9-9d0f-5a5b5f5e5d5c
Message: Starting my_view execution
Start time: a_time
End time: another_time

Trace ID: 8b56f1d4-4c4a-4ba9-9d0f-5a5b5f5e5d5c
Message: Finished my_view execution
Start time: a_time
End time: another_time

Next steps

Prometheus

There are always ways to improve, and even if Elastic APM is a potent tool when your primary concern is monitoring the performance of your applications, it has some limitations in more complex environments. It does not scale so well, especially when storing metrics from infrastructure and in large setups.

Here is where Prometheus shines. Prometheus is a tool for monitoring and alerting that collects data about your systems and services, stores it, and allows you to query it to understand how your systems are performing. Combined with tools like Grafana, you can identify and fix problems with your systems, ensure that they are running smoothly, and get notified if there are any issues.

But why stick with just one? We can get the best of both worlds. We just need to set up Elastic APM to monitor the application performance and use Prometheus to collect and store metrics from the servers and other infrastructure components. Then we can make use of tools like Grafana to visualize both types of data in the same dashboard.

To do so, we need to:

  1. Set up a Prometheus server or servers to collect and store metrics from your infrastructure.
  2. Install and configure exporters on your servers to send metrics to Prometheus.
  3. Configure your Prometheus server to scrape metrics from the exporters.

And with that, you are ready to have complete control over the status of your systems. In case of failure, you will be able to replay all the steps that led your systems to fail and thus understand exactly what happened.

Alerts

The team has connected the API with Prometheus, making their alerting system about Prometheus´ own metrics. They have used a tool that wraps the Prometheus Python library to create these metrics, following a facade pattern to control external libraries in their domain and avoid any potential issues caused by updates.

Grafana

As a part of their observability efforts, the team has set up a panel in Grafana that allows them to monitor the performance of their applications in real time. The panel is connected to a TV, making it easy for the team to monitor how their applications are working at all times. This approach enables the team to proactively identify and address any issues, ensuring that their system runs smoothly and reliably.

Grafana panel of the namespace status

The absence of data in the visualizations we have set up in Grafana is a positive sign. This is because the visualizations are based on the data collected by our monitoring tools, which are set up to track every instance of system failure. Thus, the lack of data in the visualizations indicates that our system has been running smoothly without any significant issues. When there are problems, the monitoring tools will generate data, which will then be reflected in the visualizations.

This approach allows us to proactively identify and address issues in our system rather than simply reacting to problems after they occur. By tracking system failures and performance metrics in real time, our team is able to quickly identify and resolve any issues that arise, ensuring that our system runs smoothly and reliably.

Co-written by: Alejandro Capdevila & Juanjo Ponz

--

--