The simple change I made for

Observability From a Different Perspective

8 bytes were enough to provide new insights

Gal Ashuach

Published in

Outbrain Engineering

6 min read2 days ago

Traditional-software development is fun. You work hard towards a tight deadline, a version is released, and you can put your feet up. Rest is never an option in the Saas world, a world of always-on distributed and complex systems with multiple daily deployments. There is a constant need for observability — endless monitoring to ensure all systems are working properly. We all monitor the basics like latency, resource utilization, and error rate, but root cause analysis based on simple metrics like those above isn’t always trivial. Some production issues may be hiding in the business logic.

“ True observability comes from understanding how your system actually works… Simply gathering more data isn’t the magical solution we’ve been promised ”. Aphinya Dechalert / Observability is an Illusion

In this blog post, I will share with you a story about a simple solution we came up with for adding a missing piece to the observability puzzle our system desperately needed.

A little about Outbrain

At Outbrain, we connect businesses with engaged audiences at an impressive scale: three hundred and fifteen billion monthly global impressions, handled by more than ten million requests every minute, with an average latency of a few hundred milliseconds. My team is responsible for optimizing the automated bidding of our marketers’ campaigns toward the goals they set. While most teams are oriented toward business goals like revenue and growth, our perspective is aligned with the clients, and most of our KPIs are related to customer satisfaction, like click-through rate, cost per action, and return on ad spend.

“ One of our clients complains his dog came back home with rice grains in his beard ” — my team lead, a few months back.

Well, not exactly at those words, but let’s go with that analogy.

I don’t want to scare you with too many terms and details from the ad tech industry (and disclose business logic), so for the rest of the story, you can assume I work for an exceptional doggy daycare. You can imagine my team is responsible for dog-treating robots. Dogs come into a room filled with our robots, and the robots take care of their needs: Feed them, play with them, take them for a walk, and once their needs are met — the dogs move on and a new group of dogs comes in. You can think of it as a boutique service for premium dogs — no need to focus on the fact we have hundreds of centers, that treat about forty million dogs a minute, with an average latency of about sixty milliseconds per batch of dogs.

Back to our story

I was tasked to investigate why one of our dogs kept coming home with rice grains in his beard. We don’t usually serve rice at the daycare so that might signal a problem. I tried to use our regular observability tools but they were not helpful: I could confirm that the ‘grams of rice cooking per dog’ metric was stable, and I didn’t see any interesting events in the log like a leaking bag of rice.

It is important to mention that our robots are designed to give the best treatments to these dogs, and while rice is not recommended as a long-term diet, in some cases feeding boiled rice to a dog makes sense: Rice is known to be a great treatment for an upset stomach, and if we run out of dog food — rice is a great temporary fallback. However, if the security guard complains his jambalaya went missing for the third time this week, we might have a problem.

My KISS solution

I was looking for a way to have a clear and complete context for a dog’s behavior in each visit, and the simplest way I could think of to achieve that was to keep a per-dog object and log every relevant branching in the code (ifs, switches, usages of dog-specific configuration, etc.) to it. Finding all the relevant code placements and writing an abstract code was super easy.

The next step was deciding upon implementation. My requirements for the new mechanism were simple. First, it needs to be efficient as it would be used for every dog in every batch (a very high scale). Second, it needs to be simple so my colleagues can understand it and would want to use it to track their contextual insights, rice-related or not. Third, it has to be easily extendable so they won’t give up during adoption (of code, not dog).

Time to implement

To make it efficient, I started with the most efficient building block — a primitive field. I decided 64 bits per dog should be enough. Next, I bound each bit with a significant binary decision (similar to the implementations of EnumSet). IsSick? waterBowlProvided? wentForAWalk? RequiredFoodMissing? ServedWithRegularFood? Some decisions might be more intuitive to represent with multiple options like using two bits to store typeOfFoodProvided: regular(00), rice(01), prescriptionDiet(10), instead of three (wasRegularFoodProvided(1), wasRiceProvided(1), wasPrescriptionDietFoodProvided(1)) but I decided that having a simple and user-friendly interface is more important at this point than saving a few bits here and there.

The last part was logging this number, which I called “flow ID”. I decided to add a flow ID column to our visitors' log. We have a big notebook that lists every visit to our daycare, and adding a single new column was enough to support my new mechanism.

Now, whenever a customer complains, we can recreate every crucial part of his dog’s visit using an endpoint I added that converts a flow ID back to a list of conditions and their boolean value.

Pros

Efficiency — my new implementation is highly efficient. I allocate a single long and set its bits, up to 64 times in total at worst.

Simplicity — each decision is listed in an enum (with static enforcement for size limitation) and marking a decision is as simple as calling a method with the right value in the enumeration.

Extendability — assuming there are less than 63 critical decisions in the flow, adding a new one is as simple as adding a value to the enum. There is no need to add columns or change widgets and queries.

Error detection — I can define for each flow ID if that combination of decisions indicates an error: Not sick + food in stock + rice grains in beards? error. Sick + served with regular food? Also an error.

Integration with other tools — just before the dogs leave the daycare, we have a single (yet significant) number that can be used with our contextless tools: Counter for each flow ID, logging erroneous flows, etc.

Cons

No backward compatibility — a price I pay for keeping it simple. Whenever the decision list changes — older flow IDs are no longer relevant. For my use case, backward compatibility isn’t an issue as we usually address customers’ complaints in a matter of hours; Either we’ll have new flow IDs shortly, or the issue is solved. If you do need to support multiple versions —you might want to use a more advanced tool to store the decisions, like Protobuf message.

64 decisions limitation —the solution I described is limited to 64 bits (like RegularEnumSet), but if needed it can be easily extended with an additional long / array of longs (like JumboEnumSet). While I don’t see a need for anything remotely close to 64 bits any time soon, this issue can be solved if we do get there.

Context is king

Outbrain has many state-of-the-art observability tools — we have centralized log monitoring with elastic search, advanced dashboards in Grafana, and distributed tracing with Jaeger. The problem with most of those tools is that they operate in a single line of code: Logging is great for errors like ‘all the dogs ran away’, and metrics are great for calculating ‘CPU per dog‘ or ‘average number of dogs in a batch’, but neither can be directly used for summarizing a complex flow. Tracing, on the other hand, does have context; It stores the entire flow of a request as it traverses in Outbrain microservices. However, the problem with tracing is that it focuses on the request (batch of dogs) and not on the dogs themselves. Request level metrics (batch latency, rice consumption per dog, etc.) are vital for understanding the quality of your serving, ehhh services in the dog case. Still, I needed a more precise solution to track the behavior of a single dog over time.

The bottom line

I love this story because of the simplicity of the solution. I tried to find the minimal solution that would meet the requirements. Many programmers, including myself sometimes, tend to overengineer things and pick complex solutions without actual need or reason, as if to show off their skills or knowledge. Sometimes, less is more, and in this case, I am glad I found a dummy solution that does the work.

“Everything should be made as simple as possible, but not simpler.” — Albert Einstein