When We Say “Full Observability”, We Really Mean “Full Tracing”

Microservices is one of the hottest areas in the software industry these days. Many small and enterprise organizations are considering microservices while designing new applications and even want to migrate and adapt their legacy architectures to microservice based ones.

Figure 01 — https://martinfowler.com/articles/microservices.html

While shifting to microservices, applications become more distributed. So even though monitoring and troubleshooting are still first-class citizen requirements, the way of monitoring has to be changed. Because monitoring every application/service individually is not enough anymore as there are many services in the business flow interacting with each other synchronously or asynchronously. An error or delay in a service might be caused by any of the upstream and/or downstream services. Therefore, whole services in the distributed flow should be monitored together from end to end.

I won’t talk about the advantages of microservices and how the microservices should be monitored in a distributed environment as there are already tons of posts on the net. The point I want to mention is about Thundra and its new “full tracing” capability which is a requirement of our motto “Full Serverless Observability”. In this mission, we were aware of the importance of “distributed tracing” as we already have very advanced local tracing capabilities. Therefore, we added “distributed tracing” support to our tracing capabilities, and today, I am proud to announce Thundra’s brand new and unique “Full Tracing” feature.

In this way, we have built our own OpenTracing API compatible distributed tracing engine “Otto” as it provides a general infrastructure for distributed tracing. Then we have integrated many AWS services such as:

  • Lambda
  • SQS
  • SNS
  • Kinesis
  • Firehose
  • S3
  • DynamoDB
  • API Gateway

There are some unique points for distributed tracing that can first and only Thundra supports in the serverless era:

  • “Multiple upstream transaction” support: There might be multiple upstream invocations when an invocation is triggered with a batch of messages. Let’s say that you have a Kinesis stream and you are writing records from multiple invocations. Then Kinesis triggers another Lambda function invocation with a batch of messages where each batch might contain records from different upstream invocations. In this case, you should be able to link that downstream invocation with each of the upstream invocations which push record to the stream for fully distributed tracing. In this context, we are proud of saying that we are the first and only tool that can handle this deep challenge.
  • “Business transaction” support: Sometimes invocations may not be related with each other physically (directly or indirectly) but logically. For instance, there is a blog system in which there is a flow ending with a service which writes a blog post to DynamoDB to be published. Then, after one day moderator approves that blog post to publish which triggers another flow of invocations to begin. In this case, these two flows are linked with each other logically as these are for the same blog post. So, it would be better if the developer could see the entire flow in the same picture. As we have mentioned before, we have our own distributed tracing engine “Otto” which provides a distributed tracing infrastructure independent from any 3rd party services including AWS or any others. So, on top of “Otto”, you can even link logically related transactions yourself in a customizable way. This feature will be presented at another blog as it deserves its own.

With the “Full Tracing” capability, Thundra users can monitor their Lambda functions from the high level (interacts with which services and resources) to the deep level (even can see the value of a local variable at a specific line during execution).

I can hear that you are saying “You are talking too much. Talk is cheap, show me the demo.” (just another variation of Linus Torvalds’s infamous “Talk is cheap. Show me the code” :) ) Then I am leaving the stage to Thundra.

Thundra in Action

In our demo scenario, we have a flow for saving team information. The entry point of our flow is team-save service which is a Lambda function triggered by AWS - API Gateway on team path through POST method.

When we go to “Traces” page, we can see all the individual traces with some information like

  • The trigger of the origin (AWS API Gateway, AWS SQS, AWS SNS, …)
  • The origin (aka entry point) Lambda function which is the first executed Lambda function in the flow
  • The start time of the trace and end-to-end duration of the entire trace
  • All of the interacted resources (AWS DynamoDB, AWS S3, Redis, …) from any Lambda function in the entire trace
  • Types of the thrown any error from any Lambda function in the entire trace
  • The duration breakdown of all the executed Lambda functions in the entire trace with respect to their duration in the overall transaction.

Step by step, we will go through the whole trace from the highest level to the deepest level.

Figure 02 — Trace listing

We can also filter the traces according to their origin triggers and entry Lambda functions. As we are only interested in with team-save Lambda function, let's filter with it.

Figure 03 — Trace filtering

There are some traces where DemoIllegalAccessException was thrown. Let's focus on one of them to understand what is the cause of the error.

When we click one of the erroneous traces we can see the trace map as shown below.

Figure 04 — Trace map

As you can see, the failed Lambda function is highlighted with red lines. So, we can click on that Lambda function and see that there are multiple invocations of that Lambda functions which are retries of the same trigger from S3. As you may know that even typed invocations are automatically retried a few times with some delay (typically a few minutes) by AWS Lambda until it succeeded.

Figure 05 — Invocation error and retries

In this case, the origin of the error was S3 operation as it was injected by Thundra agent intentionally for this demo.

You can have a look at the following resources to learn how to use Thundra to inject chaos your Lambda functions:

At the top right side, you can see the 3 links of each retried invocations highlighted with red-dashed lines. By clicking them, you can see details of each retried invocation where each of them has same request id and all of them have failed.

As you go to the invocation from the Lambda function node in the trace map, you can also go the owner trace from the invocation also.

Figure 06 — Go to invocation detail

Let’s say that we have clicked team-notification-transformer Lambda function node (highlighted with red lines at left-bottom side) in the trace map.

From the “Invocation” link at the right-top side, we can go the invocation of that Lambda function executed in the current trace.

Figure 07 — Go to traces from invocation

At the top, you can see the links of all the associated traces where this invocation belongs to. As I mentioned above in this post, Thundra distributed tracing engine “Otto” has “multiple upstream transaction” support which means that in case of batch of incoming messages, which are coming from different traces, the invocation can be linked to multiple upstream invocations where each of the batches of messages belongs to and this is the unique feature of Thundra’s distributed tracing support. In our case, as shown in “Figure-05”, there are 4 Firehose records triggering `team-notification-transformer` Lambda function in batches, and each of them come from different traces. With Thundra, you can trace back each of the incoming Firehose records individually.

What should you do to have this awesome feature?

We support distributed tracing feature for Java, Node.js, Python and Go runtimes for now. In order to have the trace map with your functions, you need to update your agent versions as follows:

  • For Java, the agent library version is `2.2.0` or higher. The layer version needs to be `10` or higher.
  • For Node.js, the agent library version is `2.3.0` or higher. The layer version needs to be `11` or higher.
  • For Python, the agent library version is `2.3.0` or higher. The layer version needs to be `7` or higher.
  • For Go, the agent library version is `2.1.0` or higher.

So, what’s next?

With the release of distributed tracing Thundra becomes the first and only serverless monitoring solution provides full tracing capability for your functions. But this is not enough for us. We will improve this feature by supporting more advanced filters and queries on your traces. You will be able to find traces which

  • interact with specified services like DynamoDB table or API Gateway endpoint
  • include any or specific error
  • have a longer or fewer than limit for end to end duration of the whole transaction.
  • have specified invocation/trace tag with the given value (for ex. finding relevant traces associated with any particular user)

What can I say more, stay tuned!!!

You can sign up to our web console and start experimenting. Don’t forget to explore our demo environment — no sign up needed. We are very curious about your comments and feedback. Join our Slack channel or send a tweet or contact us from our website!

Originally published at https://blog.thundra.io.