Practical Linux tracing part 4: tracing container workload

Tungdam

Published in

Coccoc Engineering Blog

7 min readOct 4, 2021

The similarity and difference in comparison with normal workload

src: https://www.kizytracking.com/wp-content/uploads/2020/10/xsmart_container_connection.jpg

TL;DR

Same same:

Tracing kernel events ( syscall, kprobe, tracepoints ) works in the same way
Profiling user-space program by perf works in the same way ( Linux 4.14+ )

But different:

Namespaces prevent tracing user-space events by kernel tracing infrastructure. ( As tools and binaries / libraries are isolated by different mount namespaces )
Distributed tracing ( Eg: Opentelemetry / Opentracing + Jaeger ) to rescue

Prerequisite

This is the 4th article in my ( messy ) practical tracing tutorial series. You should be very familiar with tracing / profiling and tools like strace , perf, bpftrace and its various tools . Otherwise, I highly recommend to start from the beginning:

And obviously, we also need to have experience with container workload as well. Not much though, knowing buzz words like docker, kubernetes, Linux cgroup / namespace… is enough.

By saying “container workload” , I mean just an application running inside a container , use containerd runtime ( directly or through docker engine ), doesn’t matter what the orchestration solution is ( k8s like us or openshift, docker swarm or mesos … )

The similarity

Process on host OS and inside container both simply are processes. And they share the kernel part. Thus, tracing that process with kernel related event like syscalls, kprobe, tracepoints are all the same.

For example, syncsnoop.bt is still working well with container workload:

syncsnoop.bt
Attaching 7 probes...
Tracing sync syscalls... Hit Ctrl-C to end.
TIME      PID    COMM             EVENT
10:27:04  1229   auditd          tracepoint:syscalls:sys_enter_fsync
10:27:04  1229   auditd          tracepoint:syscalls:sys_enter_fsync
10:27:05  11459  fluent-bit       tracepoint:syscalls:sys_enter_fsync
10:27:05  11459  fluent-bit       tracepoint:syscalls:sys_enter_fsync
10:27:05  11459  fluent-bit       tracepoint:syscalls:sys_enter_fsync
10:27:05  11459  fluent-bit       tracepoint:syscalls:sys_enter_fsync
10:27:05  11459  fluent-bit       tracepoint:syscalls:sys_enter_fsync
.....

In this snippet, auditd is running on host OS while fluent-bit is a container workload, which can be verified by checking its NSpid like this

cat /proc/1229/status | grep nspid -i
NSpid: 1229cat /proc/11459/status | grep nspid -i
NSpid: 11459 1

Here auditd has only 1 PID 1229 but fluent-bit has two of them ( 11459 in the “main” pid name space and 1 in its own namespace.

As you can see tracing tracepoints works in the same way on both processes.

Same thing with syscall and strace, as a simpler example:

strace -e fsync -p 11459
strace: Process 11459 attached
strace: [ Process PID=11459 runs in x32 mode. ]
strace: [ Process PID=11459 runs in 64 bit mode. ]
fsync(22)                               = 0
fsync(22)                               = 0
fsync(22)                               = 0
fsync(22)                               = 0
fsync(22)                               = 0
.....

The difference

Container workload has its own namespaces. ( See this if you’re not familiar with it . Feel free to suggest me a better article as a reference). In the simplest way, we can say that it has its own root filesystem, with its own software version. The binaries and libraries are isolated with our trace tools.

You can quickly realize here that we’ll have trouble with interacting to user-space application. For example, bpftrace on the host OS can’t reach the binary or .so file inside a container to inject an uprobe, thus we’re losing our super power to look into any running application on demand.

What can we do in this case ?

Stick to the old way: perf is “namespace aware”

Since linux v14.4-rc1. It means, perf since version 4.14.4 can detect if the binary that it’s profiling belongs to another namespace. If it does, perf can examine the binary’s symbols from its namespace automatically for you and give you the stack trace as usual.

So, fortunately, profiling with perf still works in the container ❤ .

Back to the example with fluent-bit above, we can profile it as if it’s a normal process on a host OS:

perf record -a -g -F 99 --call-graph dwarf -p 11459 -- sleep 3
Warning:
PID/TID switch overriding SYSTEM
[ perf record: Woken up 43 times to write data ]
[ perf record: Captured and wrote 17.774 MB perf.data (2180 samples) ]

perf script shows us strack trace like this

            ..................................
 
            55d35d9acaf5 sqlite3_step+0x169 (/fluent-bit/bin/fluent-bit)
            55d35d7fc0a8 flb_tail_db_file_offset+0x68 (/fluent-bit/bin/fluent-bit)
            55d35d7f7376 flb_tail_file_chunk+0x614 (/fluent-bit/bin/fluent-bit)
            55d35d7f1f2d in_tail_collect_static+0x6e (/fluent-bit/bin/fluent-bit)
            55d35d7bc1a3 flb_input_collector_fd+0x409 (/fluent-bit/bin/fluent-bit)
            55d35d7c838d flb_engine_handle_event+0x52f (inlined)
            55d35d7c838d flb_engine_start+0x52f (/fluent-bit/bin/fluent-bit)
            55d35d73d2b2 main+0x6c9 (/fluent-bit/bin/fluent-bit)                                        ▒

So no matter in which namespace a process is running, profiling with perf still can provide us useful info about strace trace. There’s still the case you will have a broken stack like this:

7f6cdad0b803 [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdaca5804 [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdacd84b2 [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdacda30b [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdacda91d [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdacdab95 [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdaceae3b [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdad07ce3 [unknown] (/fluent-bit/bin/out_grafana_loki.so)

but it’s actually the same old problem with missing debug symbols in binary, don’t blame container for it. You may want to revisit my first blog to fix this issue.

If profiling with perf isn’t enough for you, there are another options you may want to consider.

Running tracing tools from inside the container ?

Some may suggest installing our trace tools inside a container and run it from the inside but i am against it, because:

It can increase our container image’s size significantly.
We don’t always own the image. Cooking them from third-party image with perf / bpftrace tools is usually troublesome. Sidecar container with tracing tools can help to avoid changing app image, but it poses another serious drawback: 2 containers have to share the rootfs ( mount namespaces ) which is not easily done to my knowledge.
There are various security settings to adjust in order to give tracing tools inside container to gain the same privileges / power as ones in the host OS. Not always possible though, depends on how you setup your cluster and its security policy.
Even if we have tools inside container, we still need merge all the user-space symbols inside container with kernel ones from host OS, by mounting data from either side.

It’s possible though, but it’s just not what i recommend.

Distributed tracing AKA end-to-end tracing

Distributed problem requires a distributed solution. perf and bpftrace is super cool for kernel tracing on a specific node, but it’s not designed to solve distributed problem, i believe.

If there’s a tracing solution that can solve our problem with instrumenting container workloads, it should be able to collect:

Application events
Across not only namespaces boundary but also hosts boundary.

The distributed tracing infrastructure should support:

Tracer library instrumentation in different languages to trace different applications.
Collecting events over network, from different pods, on different hosts.
Process, aggregate, store, visualize all those data in a central place

Fortunately, there are many of such solutions for it in the market already. If you’re new to this area, read more about it in following articles to get more info about distributed tracing:

Introduction to distributed tracing by Lightstep.
Five years evolution of opensource distributed tracing by Pavol Loffay

Distributed tracing is definitely worth its own ( lengthy ) article but it’s out of scope here.

In our company we chose JaegerTracing as our backend solution ( agent / collector, storage, visualization ) ,in combination with opentelemetry as instrumentation libraries to trace container workload.

Each of our application function latency is recorded and put in a context of same request like this

As usual, there are pros and cons in comparison with our “lower-level” tracing tools.

Pros:

Solve our container working tracing problem ! Now tracing works across namespaces / services / hosts boundaries.
Provide useful context of an event: between events inside a service, in side a request which runs across different service ( thus a name end-to-end tracing )
Solve problem with interpreted / scripting language ( Eg: PHP , Python..): We now can look into specific events instead of execute_ex for PHP or _PyEval_EvalFrameDefault for Python.

Cons:

An extra effort to setup a distributed tracing infrastructure , obviously.
Performance overhead: eventually it’s a user-space instrumentation, can’t compare with the speed of BPF-based tool. And it depends on language specific implementation of opentelemetry SDK / API. We have to make trade-off between performance vs feature richness here. Though distributed tracing solutions support different sampling mechanism to handle this in a flexible way.
Event ( span ) in distributed tracing tool ) is kind of user statically-defined tracing ( USDT ) in kernel tracing world. There’s no dynamic uprobe super power.

Conclusion

In summary, this is what i would recommend you while tracing container workload:

Understand the “namespaced process nature” of a container. Use perf / bpftrace tool still to debug the kernel-related issue that may affect your container workload. ( CPU contention / starving, cgroup throttling, unnecessary / abnormal syscalls, extra unwanted IO … )
Implement distributed tracing solution to trace application-specific events. Use jaeger and opentelemetry as we do if possible, so we can discuss it and grow together ;)

Hope it helps

Small note: As of today ( 9th November 2023 ), I’m looking for a new job, ideally Senior+ SRE / DevOps Engineer role but can contribute to anything related to infrastructure engineering. If you’re hiring or know somebody is, please kindly ping me via my twitter or linkedin for further info. Of course, feel free to just connect and say hi. I would love to know more about you. Thank you a ton !