Welcome Tech
Published in

Welcome Tech

Adding traceability to Elixir applications with Spandex: our experience

At Welcome to the Jungle, our backend applications are mainly written using the Elixir language. Logs are exported to Datadog, but tracing was still missing in our toolbox. This was something we needed to fix in order to identify bottlenecks more easily, as well as the context of errors. Together, traces, logs, and metrics are known as the three pillars of observability.

So in November 2021, we decided to give the Spandex library a try after reading an interview with its author on the Datadog website. We wanted it to help us add tracing to one of our Elixir applications and report the information to our Datadog instance.

Configuring Spandex in our application was straightforward enough, thanks to a clear ReadMe file: The first traces appeared quickly without any need for configuration on the Datadog side.

After a successful initial setup, we identified one main improvement that needed to be done, which was to be able to distinguish the different layers. Also, there were a few issues: a lack of links between traces and logs, incorrect Datadog statuses with HTTP errors, and trace losses in asynchronous tasks.

In this article, we will describe how we fixed these issues using existing pull requests, reported issues, and articles written by Spandex maintainers.

Improving the identification of layers with colors

We could easily see our traces on Datadog (see Figure 1 below), but at this point it was not clear how to distinguish which one belonged to which layer (mainly the API, core functions, and database).

Figure 1: Our first traces on the Datadog website

Since Datadog can apply a different color for each service, we were able to improve the visualization of the traces by setting different service names for each layer of our application.

We decided to apply a service name to our database, one to our web API layer, and a default service name (core) for all functions between the API and the database. We used a root service name (the application name) and added an extension for each layer.

Figure 2: The same trace as in Figure 1, but with a different color for each layer/service

Thanks to this setting, it was easier to understand our traces and evaluate which one could be improved by looking at the breakdown, as seen at the far right of Figure 2.

The lack of links between traces and logs

When we looked carefully at Figure 2, we could see that no logs were associated with this trace (at the bottom, we have an underlined Logs (0) tab in blue). However, when we looked at the Datadog logs, we clearly had a log entry that should have been linked to this trace (see Figure 3 below).

Figure 3: The same request at the same time, but with no reference to the corresponding trace

To fix this problem, we had to deal with two issues relating to the export of trace/span IDs: exporting IDs with the right keys and converting integers to strings.

Using the right keys to export IDs

Datadog expects to receive log entries in the JSON format, with one line per entry. This is not the way the default Elixir Logger works, but a customized backend can be set up to export logs in the right format. For that, we chose to use LoggerJSON, which implements all the functions required to transform any Elixir structure into a JSON structure. Moreover it provides an adapter to export log entries directly in the Datadog format.

However, LoggerJSON didn’t have any specific rules for exporting trace and span IDs. Since Spandex stores them in the log metadata using the trace_id and span_id keys, the same keys were used to export them in JSON.

While exploring the Datadog documentation, we found this sentence: “If this isn’t working as expected, ensure the name of the logs attribute that contains the trace id is dd.trace_id and verify it is properly set in reserved attributes.”

So to fix this issue, we patched LoggerJSON to export trace and span IDs with the right prefix. Please note: this patch is now merged in LoggerJSON.

Datadog could now see our IDs, but it was still unable to link our traces to our logs.

Converting integers to strings

Our traces had trace and span IDs. Our logs had trace and span IDs, too. However, a careful analysis of our logs and traces on the Datadog website revealed that the IDs in our logs were very different from the IDs in our traces: The IDs in our logs looked shorter than the ones in our traces.

We had to look at the generated logs on our server and tried different conversions to understand that the Datadog JSON parser was unable to correctly decode big integers. The IDs in our traces were not affected by this problem because they are exported through a different protocol called MessagePack.

We discovered the solution in a pull request from December 2020 that had not been integrated into Spandex’s latest release when we started our integration in November 2021: We needed to convert integers into strings. This became available with Spandex 3.1.0.

It is useful to be aware that different Datadog tools do not necessarily interpret data in the same way, especially integers. If your integers in Datadog have nothing to do with those in your app, you are able to fix the problem by converting them to strings. The solution was hidden in the Datadog documentation…in a comment inside a code example:

# To preserve precision during JSON serialization, use strings for large numbers

With these two fixes, our IDs were correctly exported, in the right formats, and Datadog was able to link our traces with our logs and vice versa (see Figure 4 below).

Figure 4: In the Logs tab, we can see a log entry linked to the displayed trace

Thanks to this fix we are able to see all the logs related to a trace and vice versa. We can also identify in what context a log has been made. This is very useful for understanding the context when an error occurs.

Invalid Datadog statuses with HTTP errors

Everything was working correctly at this point until an error occurred on an endpoint. The exception was caught correctly, but the main span was still reported as an OK event by Datadog, even though the HTTP status was 500 (see Figure 5 below).

Consequently we couldn’t trigger alert messages, count how many entries were in ERROR states, or filter our traces to sift out just the errors.

Figure 5: The HTTP status is 500 (far right), but the Datadog status is OK (far left)

We found a fix for this anomaly in Spandex’s issues: Issue 24 provides a solution to the same problem — set the error status field and add an explanation before sending the result. This issue had already been closed when we found it, which is why it didn’t appear on the default issues list.

Having applied those recommendations, Datadog now correctly reports HTTP errors (see Figure 6 below).

Figure 6: The HTTP status is 500 (far right) and the Datadog status is now Error (far left)

Trace losses in asynchronous tasks

All routes were now working correctly and reported spans from the API to the database through core functions were OK. Well, every route except for one (see Figure 7 below).

Figure 7: Oops, tracing ended at the API level. What happened?

For this endpoint, we needed to spawn a process to do a background job. Since Spandex stores its data (trace IDs and span IDs) within the process dictionary, this new process didn’t have any information relating to the current span and thus could not continue the trace.

Therefore we needed to fetch the current context before starting the new process and asking Spandex to continue the trace inside it. The ReadMe file doesn’t fully describe how to handle this kind of issue but we found an article written by a maintainer of Spandex that provides the missing details using a full example.

Having added a function to handle asynchronous tasks, traces are now correctly continued even when a background job has started (see Figure 8 below).

Figure 8: Our endpoint spawned a process. Even though the main process (in red) ended, another process (in green) continued to work after the endpoint process terminated

Conclusion

In this article, we have described the main issues we encountered during our integration of the Spandex library and how we solved them. They weren’t new issues: Spandex maintainers were already aware of them or they had been encountered by other users in the past — we “just” had to find the solutions and add them to our integration.

The methods we used to fix these unexpected behaviors are not limited to the Spandex library and could be used while integrating any other existing library.

In our case, we improved our initial setup by reading pending and unmerged pull requests to find someone who had experienced the same problems and already fixed them. We also read declared issues to find a solution — including closed ones!

Other important resources are articles that have been written by library maintainers or their presentations in conferences. Users also write feedback articles to share their usage: These are goldmines for finding out practical information.

Finally, when we find a bug, we consider it important enough to fork the project, to propose a fix, and to contribute back to the project we use every day in our applications. We did that for LoggerJSON and we currently have an open pull request for Spandex.

And as all the applications we are working on at Welcome to the Jungle needed to be integrated into Datadog, we ended up building an internal library to share the knowledge described in this article with all the squads in the tech team. But that’s another story we may tell you at a later date — so stay tuned!

Written by David Julien, Lead back-end developer @ WTTJ

Edited by Anne-Laure Civeyrac

Illustration by David Adrien

Join our team!

--

--

--

This is the Welcome to the Jungle tech blog, where you’ll find articles about the machinery behind our Elixir/React applications and how we come up with our most creative designs, details of the latest releases of our open-source libraries, and much more!

Recommended from Medium

Review Apps is a game changer

GameDev : Physics and Collisions (Part I)!

My 2021 Tech Highlights

Estimating Stock Returns with Fama-French Three-Factor Model in Python

Building Supply chain platform

Exception Handling in Java - 1

How To Use Notions New Grouping Feature. 6 Examples

The Coding Interview

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
WTTJ Tech

WTTJ Tech

More from Medium

An opinionated Elixir Style Guide #2

Signup, login, and logout app in Deno using Postgres database

A faster loading store for OrbitDB

Why I don’t like API frameworks together with Serverless