Observability: The hidden stories in your data

Published in

Engineers @ The LEGO Group

11 min readJan 23, 2023

Do you really know what your services are up to?
Are they keeping their best stories to themselves?

Observability can be key to uncovering any secrets your services may be harbouring. Ensuring your services are context-rich and easily observable can save you from many headaches. From unstructured versus structured logging to unique references to data integration services and data journeys. Let’s dive into the world of observability and how it can benefit your services.

What is observability?

Observability is one of those topics in the engineering world that gets a bad rap for being boring and not something people are too eager to discuss. But what is meant when we talk about observability? Well Wikipedia defines observability as:

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

We can dig a little deeper and also look at software observability:

In distributed systems, observability is the ability to collect data about program execution, internal states of modules, and communication between components

Basically, observability is a measure of how well you can see how your service is doing by looking at what it is sending out.

With these definitions in mind, we can look towards our own services and ask ourselves a series of questions. Some of these questions can be:

What is my API doing?
How is my service performing?
Is the goal of the service being achieved?

Looking at your service you should be able to answer those questions confidently if they are properly observable.

Monitoring != Observability?

A common misconception surrounding observability is that if you have a bunch of cool-looking graphs with pretty colours and data points on that your observability job is done. However, this sadly isn’t the case.

Monitoring your service like this is great and provides useful insights into how things are performing. What this approach doesn’t provide is the context around what’s actually going through and coming out of your service.

This is where we need to make a distinction that monitoring does not equal observability. Rather we should look at monitoring as being a part of observability. We should dive deeper than metrics and be able to explore relevant information at key milestones in any service journey.

Logging is your friend!

Logging is another area of engineering that gets a bad rap for not being the most exciting thing in the world. It is often seen as an afterthought and something that is more of a ‘necessary evil’ than something carefully considered.

Admittedly you’re not likely going to be setting the world alight with your logging strategy. However, that’s not to say that logging isn’t essential to fully understanding your service.

Observability isn’t possible without logging. There are myriad of benefits from adopting a solid logging strategy. Some of these benefits include:

Gaining key insights into your service
Troubleshooting issues easier
Seeing a full journey of data through an entire process

When implementing logging within your codebase there are a few ways that it can be done. It is important to consider how your logs will be consumed, either by an external service or by human readers. Things like log formatting, structure and the data and messages included in the logs will all determine how effective your logging strategy is.

Unstructured Logging

One quick and easy form of logging is unstructured logging. Logging in this way usually entails a large dump of strings, variables or anything else that may want to be logged. This method of logging can be appealing as it is quick and easy to set up. If the speed of setup and little time investment is what you’re after then this method of logging may be for you.

Although this method of logging is nice and easy to set up, it is not without its drawbacks. First, let’s consider the actual logs themselves. If we are just dumping a massive file of strings then at some point we’re going to have to go through these and work out what’s actually going on.

The time you saved from the initial setup is then offset by how long it takes to decipher the wad of strings you’ve dumped. To make your logging strategy nice and efficient ideally, it should be easily readable by us humans.

If you are also looking at sending your logs through to a third-party service for things like visualisation on graphs and other fun things, you may find that this method of logging doesn’t always play nicely. You may have to get quite inventive in how you get the two to play nicely.

Structured Logging

On the opposite end of the logging spectrum is structured logging. The clue to this method is in the name. This approach to logging sets up an object with a defined structure that you then populate with relevant data.

Approaching your logging like this can provide many benefits. Having a clear and concise structure for your logs makes them more human-readable. In turn, this also eliminates the time sink of trying to decipher a massive dump of strings like in unstructured logs. The use of a clear and concise structure also ensures that the logs will play nicer with any third-party services that you may choose to use.

Of course, no approach is perfect and there are also some things to be considered when considering structured logging. Having a structure to your logs means you will need to spend some time defining what structure your logs should follow. This should be carefully considered as you don’t want to be filling up your logs with data that may not be helpful.

The initial setup of structured logging will likely take longer than that of unstructured logging. You will also need to invest time to ensure that only relevant data is logged. However, I personally think this time investment is worth it. If I’m trying to find the cause of an issue by going through logs, I want to be able to quickly hone in on the relevant data. I don’t want to have to try and first decipher what’s in the logs and then search for anything relevant.

There are levels to this logging stuff

One thing that is particularly helpful when logging is having set levels that your logs can be categorised into. Within the Loyalty squad at The LEGO Group we have three set levels of logs, these are:

Info — General information at a relevant place or milestone
Warn — Used for logging things like expected errors. (E.g. A user trying to log in to an account that doesn’t exist)
Error — Used for logging any error that occurs that can cause the service to fail

Having these levels helps us quickly and easily see what’s going on. We can keep track of how many actual errors we are seeing or if there’s an increase in those login attempts to accounts that don’t exist. Logging general information at key milestones also allows us to track journeys through services and see a full journey from start to finish.

All in all, combining structured logging with this levelled logging approach allows us to gain a good overall picture of our services and APIs.

Ok, we get it, you like logging...but why should I?

If you’ve made it this far through the post, first of all, thank you, secondly it must mean that you’ve got at least some interest in this observability stuff. Hopefully, you’re thinking that everything we’ve touched on so far sounds good and makes at least some sense. Let’s have a quick recap on some of the benefits you’ll see by employing a solid logging strategy:

Clarity on what your service is doing
More in-depth understanding
Keeping services running is more than looking at server metrics
Dashboard creation for visualisation
Categorisation and prioritisation

Third-Party Services

From reducing the time it takes to find and track errors, track through entire data journeys, visualising key metrics and logs, to giving engineers more clarity and context around exactly what is going on within their services. Third-party services can provide a massive boost to observability.

Dashboards

The use of third-party services can help massively with the goal of increased observability. The ability to set up dashboards on a per-service level is a great way to help with visualisation. Visualising your service metrics is great for monitoring things like memory usage, latency and much more. Plus, who doesn’t like looking at loads of pretty graphs with lots of colours on!?

Dashboards aren’t just limited to displaying machine metrics and looking pretty though. Setting up dashlets to show things like top logged info/warn/error logs, counts of any particularly critical logs and even having business relevant data easily accessible. These are just a few of the helpful features dashboards provide.

Say it with me: structured logs!

Our good friend structured logging is also able to fully flex their muscles too. Third-party services can often hook into your log structure and provide the functionality to search through logs and find information based on queries you write. This becomes an insanely powerful tool to have at your disposal when trying to debug an issue or even just find some information you need.

Alerts

Another handy feature of third-party services is that of alerts. Alerts can be set up to send out notifications as soon as a particular condition is hit. For example, if you have a log for when a 500 response is returned from a particular endpoint you would be able to set up an alert to notify your squad as soon as it happens. Alerts can be integrated into your favourite chat apps too, nice!

It is also possible to leverage data from your logs to include in the notifications sent. This can be a massive help in instances of production errors where a fast response is critical.

What should be logged?

As we are now all aboard the observability hype train, let's take a look at exactly what we should be logging. It is tempting to just throw everything, including the kitchen sink, into our structured logs. However, a more nuanced approach can be significantly more beneficial. There are a few things we should consider before we start running riot with our log calls.

What is the goal of the service?

At first thought, it may sound like a pretty obvious thing. But having a clear understanding of exactly what the goal of your service is can be massively helpful. Having this clear understanding will have a direct influence on your logging strategy.

Consider the milestones within your service. These are the key things this service is used for. Let’s use an example of a store that handles user registrations. What key milestones may this have? These could be:

User registration flow started
User registration call submitted
User registration response received
User registration success/failure
User successfully registered

All of those milestones listed can help provide a clear picture of the journey. Logging at key milestones can also help when debugging as you will easily be able to see if a journey ends before it’s supposed to or if any request/response isn’t as it should be.

What do you want to achieve with your logs?

This is another question that may sound obvious at first. However, there are many things that can be achieved by setting up a solid logging strategy. Some of these could be:

Improved error tracing
Improved understanding of your processes
Keeping tabs on data flow
Sharing insights with other teams or the wider business
All of the above

Be unique!

Unique identifiers can be one of your best friends when it comes to observability and logging in particular. These things are powerful, potentially superman level powerful.

Adding unique identifiers into your logs and carrying them throughout all the logs in a particular journey allows for easy grouping of logs. You’re not limited to just having one unique identifier either, add as many as are relevant.

When you combine these unique identifiers with structured logging and third-party services, the world is yours to explore. Writing a query targeting any specific unique identifier with a particular value would then return logs showing an entire journey from start to finish. This eliminates a log of work trying to find individual logs as you can follow the journey through similar to how you would through your code or infrastructure diagrams.

What not to log

If you take one thing away from this post please let it be:

Do not add anything to your logs that can be used to identify anyone!

This type of data can span quite a range of field types. It might not always be as obvious as things like email address/phone number etc. Anything that could be related back to a user shouldn’t make its way into your logs. Take a user's bank account number for example, while it’s not immediately usable to identify someone, it is still classed as ‘relatable data’ and can be used maliciously.

Another thing to bear in mind is the fact that your logs can end up in multiple locations. Let’s say your service is running on AWS and you have an integration setup to a third-party data visualisation/logging service. This means that your logs are accessible in both AWS and that third-party service. This has now increased the spread that your logs have so ensuring the data is free of PII and relatable data can help minimise fallout if your logs are accessed by anyone with bad intentions.

Tl;dr

We have covered a lot throughout this post and visited some key areas of consideration when thinking about the observability of your services. I will leave you with some succinct bullet points to end on:

Observability is important!
Dashboards are both pretty and useful!
Don’t just rely on metrics!
Structured logging for the win!
Only log relevant information!
Keep track of your data’s journey!
Do not log any data that can be used to identify anyone with.