Enabling Diagnostics for Rust Code

Victor Ermolaev
The Startup
Published in
7 min readSep 18, 2020

Debugging in Rust is difficult. It may be possible, but the instruments allowing to do so are nowhere near the equivalents of Java or C#, partly because the latter languages have virtual machines underneath.

Tracing is an issue closely related to debugging, although not the same. Tracing allows the programmer to see how the program reacts on external stimuli (such as network messages) and examine the program life-cycle. Clearly, tracing allows to flag situations where a proper debugging must take place.

It is often the case that developers resort to old plain println! statements to trace own code. In any reasonably large program this method will result in a hardly manageable set of println! statements creating more confusion than clarification. Let’s agree this isn’t a viable option and investigate what else the Rust ecosystem has to offer.

The log method

Next level of sophistication is the usage of log crate. This crate is an abstract layer for logging, it distinguishes 5 levels of importance (in descending order of importance): error!, warn!, info!, debug! and trace! . This crate requires a concrete logger implementation, often env_logger to be paired with log . env_logger is simple, clean, and nice to use. It requires 1 line to initialize it:

env_logger::init();

and allows to quickly configure the level of log messages using an environment variable

RUST_LOG=trace cargo run

This configuration is extremely flexible and allows to configure separate levels per module. For example:

RUST_LOG=module1=trace,module2=info cargo run

will ensure the logger only outputs messages of higher or equal importance than that specified from the appropriate modules.

This method is sufficient in many case, for example, large projects such as Hyperledger Indy use it to the fullest: entry into and exit from a function is accompanied with log statements printing out input parameters and the function result.

Asynchronous code

Before going further, let us take a short detour. Many (if not all) network applications allow for accepting external messages/requests before an earlier accepted message/request is processed. In such cases, asynchronous code is a real painkiller. Writing async code in Rust has become a breeze after Rust has matured to 1.39, and, most notably, multi-purpose futures driving platform tokio and HTTP library hyper aligned themselves with the newly introduced functionality. This totality of work had a tremendous impact how async Rust code is written. Importantly, tokio introduces a concept of a task, a piece of work which is executed in parallel in a multi-threaded context, task can be paused and passed around between the threads. Tasks are paused at the await calls until the call has been completed, e.g., an event has occurred or the required data have arrived.

The tracing method

I gave one example of how a large project uses the log crate to generate execution traces. It works, why would one look for another option? There are two reasons to look further:

  1. Size. Although the trace log demonstrates only linearly growth with respect to the size of the code base, it may become unwieldy for manual inspection.
  2. Code pollution. Logging statements, although useful, often decrease overall readability, especially if used to print out function arguments and evaluation results.
  3. Asynchronous code. Parallel execution messes tracing up. Often, log traces from different tasks interleave each other rendering them intractable.

The sum of these points begs the question: what’s the alternative. Enter the tracing crate. This crate provides instrumenting functionality to collect structured diagnostic from your application. Its core functionality is based on the premise that all logging happens within spans. A span is a continuous stretch of time with a defined start and end. Spans are entered and left, and other and spans can be entered and left within a span. This property defines an inclusion relation on a set of all spans. The inclusion relation allows to structure log messages in a time domain, but, prominently, defines causal relationship on spans. The latter is very much important in async code, when tasks can be paused and picked up.

In the above diagram, Task A is associated with Span A, Task B with Span B, etc. Denote the respective task starting times with t(·). In the time domain, the starting times are completely ordered, t(A) < t(B) < t(C) < t(D). In the causal domain the order relation is partial, this can represented, using the set inclusion sign liberally, as A ⊆ B, B ⊆ C, B ⊆ D, B is caused by A, C by B, etc.

The tracing crate has an open-ended interface allowing to plug in different collectors or subscribers for the tracing messages. The simplest subscriber logs directly to the standard output. This is barely an improvement, luckily, it’s not the only option. To really stand out, these traces:

  1. Must be persistent, thus be available also when the respective application is not running.
  2. Must be conveniently accessible from different hosts, not necessarily from the host executing the application.

One subscriber fulfilling these requirements is tracing-opentelemetry.

Let’s dive in!

Create an async function doing some heavy work:

struct Dummy;#[tracing::instrument(skip(_dummy))]
async fn heavy_work(id: String, units: u64, _dummy: Dummy) -> String {
for i in 1..=units {
tokio::time::delay_for(Duration::from_secs(1)).await;
tracing::info!("{} has been working for {} units", id, i);
}

id
}

Let’s unpack.

  1. Instrumentation. Macro #[tracing::instrument], roughly speaking, creates a span for this function and enters it. Macro uses the function name and its parameters for its metadata. This creates the requirement for the parameters to implement the Debug trait; to exclude a parameter from metadata, use skip in the instrumentation instruction. More attributes can be found here.
  2. tracing::info! does the actual tracing.

Let’s reuse a tracer initialization functioninit_tracer, it’s good enough to start developing and likely to be adjusted when more customization is required.

Finally, the main function:

#[tokio::main]
async fn main() {
init_tracer().expect("Tracer setup failed");
let root = tracing::span!(tracing::Level::TRACE, "lifecycle");
let _enter = root.enter();

let rng = rand::thread_rng();
let normal = rand_distr::Normal::new(5.0, 1.0).unwrap();

let mut finished_work = normal
.sample_iter(rng)
.take(10)
.map(|t| heavy_work(t.to_string(), t.to_u64().expect("Must convert"), Dummy))
.collect::<FuturesUnordered<_>>();

while let Some(id) = finished_work.next().await {
tracing::info!("{} has completed", id);
}
}

Related to the topic, this function accomplishes two things: initializes a tracer and enters the root span. The rest is just a wrapper making calls to the function doing the heavy work and receiving the results.

The Rust compiler de-sugars async functions into a state machine, which is polled at the execution time. What seems to be a single functional call for the programmer is, in fact, an act of repeated function invocation. For tracing it means, that the respective span will be entered and left many times before the underlying state machine completes. This behavior also presents a challenge to tracing, because it complicates the treatment of span inclusions. Instead of one cohesive span corresponding to an async piece of code, there will be a plurality of smaller spans.

One option to parse the tracing logs (adhering to the opentelemetry specifications) is to use Jaeger. This is a tool specifically designed to increase observability of how applications are executed.

To collect and analyze traces, Docker must be installed. Run Jaeger

docker run -d -p6831:6831/udp -p6832:6832/udp -p16686:16686 jaegertracing/all-in-one:latest

Run your application.

Open your favorite browser and go to http://localhost:16686/. What you see is quite impressive. It is a tool chest full of neat visualizations and analytical instruments.

I created a small script running the application continuously to simulate a load making the execution time graph at the top somewhat populated.

Every trace can be examined structurally and temporally.

Trace composition

Every span corresponds to a function call. Notice how neatly messages from multiple async sub-calls are grouped together to create an impression of uninterrupted execution.

Messages produced during execution of the respective function

Span’s meta information allows for easy inspection of the function arguments, for example, units = 5 .

Next level of (experimental) sophistication is a graph representing the causal relationship discussed above. This representation can be found under a drop-down list “Trace Timeline”.

Conclusion

Crates tracing and tracing-opentelemetry enable simple and clean tracing of Rust applications. Jaeger presents such traces in a comprehensible way and provides analytics over them. This piece barely touches the possibilities provided by these tools, but, in essence, provides a good starting point to explore further!

Source code can be found on Github.

--

--