Leveraging Correlation IDs to Increase Backend System Observability

Arda
Insider Engineering
6 min readJun 26, 2023
Photo by Luca Bravo on Unsplash

Developing and managing distributed systems, especially those utilizing a microservices architecture, is far from easy. Unlike monolithic architectures where everything is bundled into one, distributed systems divide applications into smaller, independent services that communicate with each other. This modularity increases flexibility and scalability, but it increases debugging and troubleshooting efforts exponentially.

When we truly understand the complexities of managing distributed systems, we start to realize that we need to switch from “monitorable” systems to “observable” systems. These concepts might sound like they are the same but in the world of distributed systems, the difference is significant and can drastically impact the ease of system management and troubleshooting.

Preparing Our Mind for Observability

A monitorable system can be compared to a pre-set dashboard that provides you with a series of predefined metrics or logs about the system’s state, similar to a web server showing you data like server load, response times, and error rates. It provides a high-level snapshot of the system’s health, which can be vital for identifying potential issues. However, these metrics are typically predefined, focusing on the system behaviors that are expected or known.

On the other hand, observability is similar to being able to dynamically query your system to learn about its state in ways that weren’t predefined. Imagine that you can query your system instead of only looking at some specific data. It allows you to dive deeper into your system and investigate unexpected issues. When a distributed microservices system is in production, we can’t foresee all the potential issues we might encounter. To truly have an observable system, we need the capacity to ask it random and unpredictable questions to uncover the root cause of problems.

The cornerstone of an observable system is a strong logging system. Logs are critical to track a workflow from beginning to end, understanding what went wrong in the event of an error, and debugging issues effectively.

Example Architecture and Logs Without Correlation IDs

Let’s consider an example of a distributed microservices architecture. Suppose we have three microservices: order, shipment, and warehouse.

An Example Architecture

Each service logs its operations, but without a means to distinguish between different transactions, it can be nearly impossible to trace the journey of a specific order through these services.

Consider a case where both UserX and UserY order the same item: item1.

UserX and UserY orders item 1

After each order is received from users, order service creates a job for shipments.

Shipment service queries warehouse service

Shipment service synchronously queries warehouse service as the last check of item availability.

Warehouse service responds to requests from the shipment service

Let’s say the warehouse service experiences a database outage and can’t query its own database, and thus can’t respond to other services.

Overall workflow

Both warehouse and shipment services logs error messages about the failure, but without a concrete reference to a purchase workflow/transaction.

Also, take into account that these services log thousands of the same messages regarding this failure and for the same items, possibly.

As the engineer of this system, how easily can you correlate all the logs (especially errors) regarding a single workflow? Is it easy to identify why the order request is successful but not the shipment? Many people do not think so. The question then becomes: how can we truly leverage the potential of a robust logging system in a distributed microservices architecture?

Correlation IDs

The answer to the question above lies in the concept of “correlation IDs.”

A correlation ID is a unique identifier generated at the beginning of a transaction. This ID is then propagated across the components and services in the system and attached to every single log event associated with that specific transaction. With a correlation ID, we can trace the entire path of a transaction across multiple microservices, databases, messaging queues, and other components of a distributed system.

Where to Generate?

A correlation ID is usually generated at the start of a transaction or request. It should be done as early as possible. You can simply use UUIDs for the format of the correlation ID.

How to Propagate?

The correlation ID needs to be passed to every part of the system that the transaction interacts with. It often involves including the ID in HTTP headers or message queue properties.

You can use X-Correlation-ID header in HTTP requests and a similar property in message queues.

How to Attach Correlation ID to Logs?

Your logging system should be modified to automatically include the correlation ID in every log message related to the transaction. After this, every log message in your system will have some form like this:

[4f9ab9fb-3a78–430a-9c41-ac1c442740aa] Order service received request for item 1

How to Query?

To retrieve logs for a specific transaction, you need a centralized log management system where you can search for a specific correlation ID.

Examples include AWS EventWatch, Datadog, Elasticsearch-Logstash-Kibana (ELK) stack.

The Example Architecture and Logs With Correlation IDs

Now, let’s revisit our earlier example of the order, payment, and shipment microservices but this time with correlation IDs.

When an order is placed, we generate a correlation ID at the beginning of the transaction. As the request traverses through the order, payment, and shipment services, the correlation ID is logged with each operation.

Order Service

Order service receives requests and assigns correlation IDs for each transaction. It creates a shipment job and attaches the correlation ID to the job.

Shipment service receives jobs

Shipment service receives the job and correlation ID along with it. It attaches the correlation ID to the synchronous HTTP request to the warehouse service (possibly using X-Correlation-ID header).

Warehouse service does lookups

The warehouse service receives the request and attaches the correlation ID to its logs.

Overall workflow with correlation IDs

Even if the logs are cluttered with transactions, we can filter logs based on the correlation ID, allowing us to see a clear, chronological trace of the transaction through the system. This way, we can easily identify any issues and bottlenecks.

Logs filtered with correlation ID: abc2

Now, we can easily understand the whole flow of the transaction and identify the exact source of the problem.

Next Steps

If correlation IDs alone are not sufficient for your system observability needs, you can check distributed tracing and its tools. Jager, for example, provides service dependency analysis, distributed transaction monitoring, and deep troubleshooting opportunities.

Conclusion

Managing and developing distributed systems may be a challenge, but with tools like correlation IDs, we can achieve true system observability. By using correlation IDs, we can unravel the journey of a transaction. In addition to other benefits of correlation IDs, debugging and troubleshooting become a more straightforward, manageable task.

If you liked this article, consider reading Insiders’ Email Product Monorepo Journey from Doğukan Aydoğdu where he explains how we structure our Go projects and key learnings.

References

--

--