Background

About a year ago, my team took over responsibility for an event driven system written in Go. One of our first steps was to understand what the application does, and how it does it. The app already had stdout logging, rollbar notifications, datadog statsd metrics, and pager alerts. The previous team had built the app from the ground up and understood its common failures modes. Without that kind of knowledge, we needed much more detail in order to understand what the application was doing. Initially, diagnosing issues was basically impossible.

Logging

Our first step was to add more logging coverage to make processing more observable. We added more comprehensive info level logging which described each step of processing. We added more details to the logs, but this required a lot of repetition, and the formatting of the logs was inconsistent.

Previous projects I’d worked on had extremely detailed logging: in some cases, the entire system state could be reproduced just from the logs. We didn’t take logging to that extreme, but I took the stance that more logging is always better.

We were able to get all of our logging into Stackdriver, which made a huge difference. Being able to tail all of your aggregated logs is a luxury I don’t want to give up.

Error Handling

In Go, every error should be handled explicitly. Go doesn’t have exceptions, so at least some error handling code will always be intermixed with the business logic. While many errors in the project were being handled, a lot of minor errors and some edge cases weren’t. The errors also didn’t have a lot of detail in them. Seeing a ‘not found’ error really doesn’t help much without the detail of what the problem was. Part of the issue is that Go’s default error type is just a wrapper for a string.

I created a structured error type which included a generic map of data, so that at the top level you could diagnose what went wrong. This added detail made it easier to determine which errors were fatal, and which indicated that processing should be retried. This choice is particularly important in an event driven system, as the asynchronous processing model makes it much harder to provide feedback in error cases. Without these detailed errors, event processing could be started and seemingly never finish.

This implementation had the same problem with repetition as our logging, but it was still an improvement.

Context

We were having some problems with application shutdown and realised that Go’s context package could help. It’s also fairly core to the GCP pubsub libraries which we were using heavily. Introducing it required a lot of changes to thread context through all of our request handlers and message processors. It was hard to see the benefit at first, but meant we had much more control over our services: now when we had a new deployment or scaling event, we could be sure that all services were running the new code and all scaled down servers would terminate quickly.

Combined with improvements to failure/retry handling, a more reliable shutdown meant that we had much more dependable and predictable processing.

Structured Logging & Log Levels

At a certain point, the sheer number of log entries became too high in production. In many cases, the majority of entries were just seen as noise. We also wanted a better way to search for particular logs, for example by user id or by server.

Having multiple log levels allowed us to have much more detail in development, but keep prod logs just to what’s generally interesting. Structure logging makes reading and writing logs simpler by having a consistent format for the details in your logs, while still having free form text for the log message.

The logrus package provides both these features and was simple to configure.

Detailed Context

The details that were being manually added to each log line and each error within any part of processing were virtually identical within any particular worker. I found myself describing those details as the ‘context’ of that processing. I then realised that the go context could be used to hold all those details. I wrote a simple function to add details to the context:

func WithData(ctx context.Context, data ...interface{}) context.Context

The data part is interpreted as a list of key/value pairs, which requires less boiler plate than the equivalent map[string]interface{} literal value.

Because context is passed down through the call stack, the details added to it also follow the call stack. Each function call potentially adding more specific details. As this is unstructured metadata, it’s not used for any business logic. This ensures that we have the complete detailed context at any point in the code.

Detailed Errors and Logging

The next step was to combine the previous pieces into a common abstraction, with a view to reduce code repetition. I added new functions for creating errors and logging, all with essentially the same interface, which could use the detailed context that is created with the previously describedWithData function.

Error:

func New(ctx context.Context, message string, cause error)

Logging:

func Info(ctx context.Context, message string)

func Error(ctx context.Context, message string, err error)

Rollbar error notifications:

func NotifyError(ctx context.Context, message string, err error)

This massively reduced the amount of boilerplate required for creating detailed errors and logging. For example, just a single call to WithData when a message starts processing with the message_id , means that all errors and logs relevant to that message include the message_id .

This has also encouraged us to use context in more places, which will help us to complete the migration to using it throughout the application.

This means that when a single error occurs, and is passed back up the call stack to the error handling, it includes all of the relevant metadata for how and why the error happened. This is because it includes the context.Context from the location that the error was created.

SLIs

This being an event driven system with asynchronous processing, we found it hard to see how the performance of different parts of the system affected our end users. By adding Service Level Indicators for a couple of high level features, we were able to reveal end to end latency of different kinds of processing.

I added a small amount of timing metadata to each message, so that it was threaded through all stages of processing. In most cases this metadata was just the time at which the event was first received. Then for each of the high level features, at the point in the processing where it was considered finished, it would record the overall processing time to datadog. We also added other kinds of metadata so we could categorise the different kinds of usage and how that affects performance, for example the number of users a customer has affects some features more than others.

We can now visualise the direct user impact of changes to the system, and work towards defining Service Level Objectives as well.

Observability

Now when we look at any log statement, error, nested error or rollbar notification, we can see all the relevant details. We can also create very high level customer focused datadog dashboards to visualise the system.

By increasing our use of metrics and alerts, our application is much more observable. We can now track down issues and trace individual messages processing through the system. Issues that previously took days to properly diagnose can now be clearly seen, easily. Collecting data to support proposed new features and architectural changes is straightforward and quick.

We’ve taken an iterative approach to improving the observability of the system; we’ve taken big steps forward, but we’re far from finished.

Zendesk Engineering

Engineering @ Zendesk

Thanks to Steven McPhillips and Adel Smee

Casper Szymiczek-Graley

Written by

Zendesk Engineering

Engineering @ Zendesk

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade