Distributed tracing in Serverless with X-Ray, Lambda, SQS and Golang

Filip Lubniewski
4 min readApr 4, 2020

--

Introduction

Recently I needed to introduce AWS X-Ray to the project that I was working on at work to improve its observability. Watching the workflow via Cloudwatch logs and finding out which part is faulty is neither convenient nor quick. X-Ray should fit here perfectly. How to do it? Let’s find out.

Essentially, the workflow was a bunch of Lambda functions communicating through SQS queues. Each function processes its part and passes it further to next SQS where it’s picked up by another function. Such passing further repeats a few more times, so the whole workflow spans across a few functions and queues, that’s all.

This approach is quite convenient as returning errors from functions invoked by SQS can cause retries, so we heavily based our logic on AWS infrastructure and that’s a good thing as in cloud infrastructure is the new king. We just don’t need to implement that by ourselves.

Problem statement

Moving on to X-Ray, to use it with Lambda service we just need to enable tracing for each function, SQS is supported by X-Ray since August 2019. I thought that with this setup everything should work just out-of-the-box, unfortunately not. But first things first, a couple of important facts that I should mention before jumping into the solution:

X-Ray glossary:

Trace ID — a unique ID that is shared across components of the application. It allows to better understand and combine how these components interact with each other. Based on that X-Ray knows how to group segments.

Segment — a traceable component of the application, it can be represented by e.g. lambda function or SQS.

These and other X-Ray concepts are explained here in more details.

  1. Lambda with enabled tracing receives X-Ray trace ID. In Golang it’s passed to the lambda handler as one of the context.Context values. You can find trace ID also in default lambda logs in Cloudwatch. If we want to see the full workflow on the X-Ray diagram across multiple lambda functions and queues we need to push forward to downstream components exact same trace ID across all of the components and each segment needs to have this ID.
  2. X-Ray segments are created on behalf of the lambda function, so it’s not possible to access them from lambda code and modify them, we can create only subsegments inside already created parent segment. It was discussed here.
  3. Segment and its trace ID created on behalf of the lambda invoked by SQS, even though SQS supports X-Ray, is not equal to the trace ID passed from upstream lambda. It will have a completely different trace ID, in our case, it means that we won’t see a nice diagram in X-Ray console with the following segments, one after another. Yet.
  4. Since SQS supports X-Ray its message contains trace ID from the sender lambda as one of its attributes.

What is the solution?

Based on the above facts I came up with the following solution:

  1. First lambda from the chain should have tracing in active mode so it will allow sampling, all other downstream functions don’t need it so they should stay in pass through mode.
  2. Every downstream function should create a new segment from handler explicitly using AWS SDK.
  3. We can leverage trace ID which is passed as one of the SQS message attributes. Thanks to that we can set, in our newly created segment, trace ID that was passed through all of the previous components. Using the same trace ID for all segments will result in connecting all segments with arrows on the X-Ray diagram, including the first function which we owe sampling decision.
  4. We’ll also need to set in our segment its parent ID because it originates from already instrumented function and bool flag whether sampling is taking place at all.

Here is an example of how your downstream function handler may look like:

Downstream function handler

To clarify the snippet above, we access the first record from the SQS event for the brevity of example. You may have to process multiple records and each of them could have different trace ID, depending on where do they come from. If they come from the same lambda execution they will have exact same trace ID.

header.FromString is a function from AWS Go SDK package. Trace header is passed in SQS message attribute and looks like the following string:

X-Amzn-Trace-Id: Root=1-5759e988-bd862e3fe1be46a994272793;Parent=53995c3f42cd8ad8;Sampled=1

header.FromString simply parses it into an easily accessible structure with TraceID, ParentID and SamplingDecision fields.

How it looks like in X-Ray

Coming to the end, here is how it will finally look like in X-Ray console. I highlighted with green arrow segment that was created on behalf of lambda, which is our first function in active mode. Yellow arrows point to our custom segments which are downstream functions in pass through mode. They don’t have lambda logo, as the first segment does, because they’re completely custom segments.

X-Ray console

Let me know if you know any other, simpler way of doing this. If not, there is definitely room for improvement here for AWS team.

Thanks for reading, see you next time!

If you’re interested in learning more about serverless and Go you should definitely check out my previous article — How to leverage AWS Lambda timeouts with Go context cancellation.

Thanks for reviewing this article goes to my great colleagues Tomasz Czubocha and Marcin Sodkiewicz!

--

--

Filip Lubniewski

Software Engineer @ OLX Group — Creating Serverless applications with Go on AWS