Adding traceability to Elixir applications with Spandex: our experience
At Welcome to the Jungle, our backend applications are mainly written using the Elixir language. Logs are exported to Datadog, but tracing was still missing in our toolbox. This was something we needed to fix in order to identify bottlenecks more easily, as well as the context of errors. Together, traces, logs, and metrics are known as the three pillars of observability.
So in November 2021, we decided to give the Spandex library a try after reading an interview with its author on the Datadog website. We wanted it to help us add tracing to one of our Elixir applications and report the information to our Datadog instance.
Configuring Spandex in our application was straightforward enough, thanks to a clear ReadMe file: The first traces appeared quickly without any need for configuration on the Datadog side.
After a successful initial setup, we identified one main improvement that needed to be done, which was to be able to distinguish the different layers. Also, there were a few issues: a lack of links between traces and logs, incorrect Datadog statuses with HTTP errors, and trace losses in asynchronous tasks.
In this article, we will describe how we fixed these issues using existing pull requests, reported issues, and articles written by Spandex maintainers.
Improving the identification of layers with colors
We could easily see our traces on Datadog (see Figure 1 below), but at this point it was not clear how to distinguish which one belonged to which layer (mainly the API, core functions, and database).
Since Datadog can apply a different color for each service, we were able to improve the visualization of the traces by setting different service names for each layer of our application.
We decided to apply a service name to our database, one to our web API layer, and a default service name (core) for all functions between the API and the database. We used a root service name (the application name) and added an extension for each layer.
Thanks to this setting, it was easier to understand our traces and evaluate which one could be improved by looking at the breakdown, as seen at the far right of Figure 2.
The lack of links between traces and logs
When we looked carefully at Figure 2, we could see that no logs were associated with this trace (at the bottom, we have an underlined Logs (0) tab in blue). However, when we looked at the Datadog logs, we clearly had a log entry that should have been linked to this trace (see Figure 3 below).
To fix this problem, we had to deal with two issues relating to the export of trace/span IDs: exporting IDs with the right keys and converting integers to strings.
Using the right keys to export IDs
Datadog expects to receive log entries in the JSON format, with one line per entry. This is not the way the default Elixir Logger works, but a customized backend can be set up to export logs in the right format. For that, we chose to use LoggerJSON, which implements all the functions required to transform any Elixir structure into a JSON structure. Moreover it provides an adapter to export log entries directly in the Datadog format.
However, LoggerJSON didn’t have any specific rules for exporting trace and span IDs. Since Spandex stores them in the log metadata using the trace_id and span_id keys, the same keys were used to export them in JSON.
While exploring the Datadog documentation, we found this sentence: “If this isn’t working as expected, ensure the name of the logs attribute that contains the trace id is dd.trace_id and verify it is properly set in reserved attributes.”
So to fix this issue, we patched LoggerJSON to export trace and span IDs with the right prefix. Please note: this patch is now merged in LoggerJSON.
Datadog could now see our IDs, but it was still unable to link our traces to our logs.
Converting integers to strings
Our traces had trace and span IDs. Our logs had trace and span IDs, too. However, a careful analysis of our logs and traces on the Datadog website revealed that the IDs in our logs were very different from the IDs in our traces: The IDs in our logs looked shorter than the ones in our traces.
We had to look at the generated logs on our server and tried different conversions to understand that the Datadog JSON parser was unable to correctly decode big integers. The IDs in our traces were not affected by this problem because they are exported through a different protocol called MessagePack.
We discovered the solution in a pull request from December 2020 that had not been integrated into Spandex’s latest release when we started our integration in November 2021: We needed to convert integers into strings. This became available with Spandex 3.1.0.
It is useful to be aware that different Datadog tools do not necessarily interpret data in the same way, especially integers. If your integers in Datadog have nothing to do with those in your app, you are able to fix the problem by converting them to strings. The solution was hidden in the Datadog documentation…in a comment inside a code example:
# To preserve precision during JSON serialization, use strings for large numbers
With these two fixes, our IDs were correctly exported, in the right formats, and Datadog was able to link our traces with our logs and vice versa (see Figure 4 below).
Thanks to this fix we are able to see all the logs related to a trace and vice versa. We can also identify in what context a log has been made. This is very useful for understanding the context when an error occurs.
Invalid Datadog statuses with HTTP errors
Everything was working correctly at this point until an error occurred on an endpoint. The exception was caught correctly, but the main span was still reported as an OK event by Datadog, even though the HTTP status was 500 (see Figure 5 below).
Consequently we couldn’t trigger alert messages, count how many entries were in ERROR states, or filter our traces to sift out just the errors.
We found a fix for this anomaly in Spandex’s issues: Issue 24 provides a solution to the same problem — set the error status field and add an explanation before sending the result. This issue had already been closed when we found it, which is why it didn’t appear on the default issues list.
Having applied those recommendations, Datadog now correctly reports HTTP errors (see Figure 6 below).
Trace losses in asynchronous tasks
All routes were now working correctly and reported spans from the API to the database through core functions were OK. Well, every route except for one (see Figure 7 below).
For this endpoint, we needed to spawn a process to do a background job. Since Spandex stores its data (trace IDs and span IDs) within the process dictionary, this new process didn’t have any information relating to the current span and thus could not continue the trace.
Therefore we needed to fetch the current context before starting the new process and asking Spandex to continue the trace inside it. The ReadMe file doesn’t fully describe how to handle this kind of issue but we found an article written by a maintainer of Spandex that provides the missing details using a full example.
Having added a function to handle asynchronous tasks, traces are now correctly continued even when a background job has started (see Figure 8 below).
In this article, we have described the main issues we encountered during our integration of the Spandex library and how we solved them. They weren’t new issues: Spandex maintainers were already aware of them or they had been encountered by other users in the past — we “just” had to find the solutions and add them to our integration.
The methods we used to fix these unexpected behaviors are not limited to the Spandex library and could be used while integrating any other existing library.
In our case, we improved our initial setup by reading pending and unmerged pull requests to find someone who had experienced the same problems and already fixed them. We also read declared issues to find a solution — including closed ones!
Other important resources are articles that have been written by library maintainers or their presentations in conferences. Users also write feedback articles to share their usage: These are goldmines for finding out practical information.
Finally, when we find a bug, we consider it important enough to fork the project, to propose a fix, and to contribute back to the project we use every day in our applications. We did that for LoggerJSON and we currently have an open pull request for Spandex.
And as all the applications we are working on at Welcome to the Jungle needed to be integrated into Datadog, we ended up building an internal library to share the knowledge described in this article with all the squads in the tech team. But that’s another story we may tell you at a later date — so stay tuned!
Written by David Julien, Lead back-end developer @ WTTJ
Edited by Anne-Laure Civeyrac
Illustration by David Adrien