Practical Linux tracing part 4: tracing container workload

The similarity and difference in comparison with normal workload

src: https://www.kizytracking.com/wp-content/uploads/2020/10/xsmart_container_connection.jpg

TL;DR

Same same:

  • Tracing kernel events ( syscall, kprobe, tracepoints ) works in the same way
  • Profiling user-space program by perf works in the same way ( Linux 4.14+ )

But different:

  • Namespaces prevent tracing user-space events by kernel tracing infrastructure. ( As tools and binaries / libraries are isolated by different mount namespaces )
  • Distributed tracing ( Eg: Opentelemetry / Opentracing + Jaeger ) to rescue

Prerequisite

This is the 4th article in my ( messy ) practical tracing tutorial series. You should be very familiar with tracing / profiling and tools like strace , perf, bpftrace and its various tools . Otherwise, I highly recommend to start from the beginning:

And obviously, we also need to have experience with container workload as well. Not much though, knowing buzz words like docker, kubernetes, Linux cgroup / namespace… is enough.

By saying “container workload” , I mean just an application running inside a container , use containerd runtime ( directly or through docker engine ), doesn’t matter what the orchestration solution is ( k8s like us or openshift, docker swarm or mesos … )

The similarity

Process on host OS and inside container both simply are processes. And they share the kernel part. Thus, tracing that process with kernel related event like syscalls, kprobe, tracepoints are all the same.

For example, syncsnoop.bt is still working well with container workload:

syncsnoop.bt
Attaching 7 probes...
Tracing sync syscalls... Hit Ctrl-C to end.
TIME PID COMM EVENT
10:27:04 1229 auditd tracepoint:syscalls:sys_enter_fsync
10:27:04 1229 auditd tracepoint:syscalls:sys_enter_fsync
10:27:05 11459 fluent-bit tracepoint:syscalls:sys_enter_fsync
10:27:05 11459 fluent-bit tracepoint:syscalls:sys_enter_fsync
10:27:05 11459 fluent-bit tracepoint:syscalls:sys_enter_fsync
10:27:05 11459 fluent-bit tracepoint:syscalls:sys_enter_fsync
10:27:05 11459 fluent-bit tracepoint:syscalls:sys_enter_fsync
.....

In this snippet, auditd is running on host OS while fluent-bit is a container workload, which can be verified by checking its NSpid like this

cat /proc/1229/status | grep nspid -i
NSpid: 1229
cat /proc/11459/status | grep nspid -i
NSpid: 11459 1

Here auditd has only 1 PID 1229 but fluent-bit has two of them ( 11459 in the “main” pid name space and 1 in its own namespace.

As you can see tracing tracepoints works in the same way on both processes.

Same thing with syscall and strace, as a simpler example:

strace -e fsync -p 11459
strace: Process 11459 attached
strace: [ Process PID=11459 runs in x32 mode. ]
strace: [ Process PID=11459 runs in 64 bit mode. ]
fsync(22) = 0
fsync(22) = 0
fsync(22) = 0
fsync(22) = 0
fsync(22) = 0
.....

The difference

Container workload has its own namespaces. ( See this if you’re not familiar with it . Feel free to suggest me a better article as a reference). In the simplest way, we can say that it has its own root filesystem, with its own software version. The binaries and libraries are isolated with our trace tools.

You can quickly realize here that we’ll have trouble with interacting to user-space application. For example, bpftrace on the host OS can’t reach the binary or .so file inside a container to inject an uprobe, thus we’re losing our super power to look into any running application on demand.

What can we do in this case ?

Stick to the old way: perf is “namespace aware”

Since linux v14.4-rc1. It means, perf since version 4.14.4 can detect if the binary that it’s profiling belongs to another namespace. If it does, perf can examine the binary’s symbols from its namespace automatically for you and give you the stack trace as usual.

So, fortunately, profiling with perf still works in the container ❤ .

Back to the example with fluent-bit above, we can profile it as if it’s a normal process on a host OS:

perf record -a -g -F 99 --call-graph dwarf -p 11459 -- sleep 3
Warning:
PID/TID switch overriding SYSTEM
[ perf record: Woken up 43 times to write data ]
[ perf record: Captured and wrote 17.774 MB perf.data (2180 samples) ]

perf script shows us strack trace like this

            ..................................

55d35d9acaf5 sqlite3_step+0x169 (/fluent-bit/bin/fluent-bit)
55d35d7fc0a8 flb_tail_db_file_offset+0x68 (/fluent-bit/bin/fluent-bit)
55d35d7f7376 flb_tail_file_chunk+0x614 (/fluent-bit/bin/fluent-bit)
55d35d7f1f2d in_tail_collect_static+0x6e (/fluent-bit/bin/fluent-bit)
55d35d7bc1a3 flb_input_collector_fd+0x409 (/fluent-bit/bin/fluent-bit)
55d35d7c838d flb_engine_handle_event+0x52f (inlined)
55d35d7c838d flb_engine_start+0x52f (/fluent-bit/bin/fluent-bit)
55d35d73d2b2 main+0x6c9 (/fluent-bit/bin/fluent-bit) ▒

So no matter in which namespace a process is running, profiling with perf still can provide us useful info about strace trace. There’s still the case you will have a broken stack like this:

7f6cdad0b803 [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdaca5804 [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdacd84b2 [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdacda30b [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdacda91d [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdacdab95 [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdaceae3b [unknown] (/fluent-bit/bin/out_grafana_loki.so)
7f6cdad07ce3 [unknown] (/fluent-bit/bin/out_grafana_loki.so)

but it’s actually the same old problem with missing debug symbols in binary, don’t blame container for it. You may want to revisit my first blog to fix this issue.

If profiling with perf isn’t enough for you, there are another options you may want to consider.

Running tracing tools from inside the container ?

Some may suggest installing our trace tools inside a container and run it from the inside but i am against it, because:

  • It can increase our container image’s size significantly.
  • We don’t always own the image. Cooking them from third-party image with perf / bpftrace tools is usually troublesome. Sidecar container with tracing tools can help to avoid changing app image, but it poses another serious drawback: 2 containers have to share the rootfs ( mount namespaces ) which is not easily done to my knowledge.
  • There are various security settings to adjust in order to give tracing tools inside container to gain the same privileges / power as ones in the host OS. Not always possible though, depends on how you setup your cluster and its security policy.
  • Even if we have tools inside container, we still need merge all the user-space symbols inside container with kernel ones from host OS, by mounting data from either side.

It’s possible though, but it’s just not what i recommend.

Distributed tracing AKA end-to-end tracing

Distributed problem requires a distributed solution. perf and bpftrace is super cool for kernel tracing on a specific node, but it’s not designed to solve distributed problem, i believe.

If there’s a tracing solution that can solve our problem with instrumenting container workloads, it should be able to collect:

  • Application events
  • Across not only namespaces boundary but also hosts boundary.

The distributed tracing infrastructure should support:

  • Tracer library instrumentation in different languages to trace different applications.
  • Collecting events over network, from different pods, on different hosts.
  • Process, aggregate, store, visualize all those data in a central place

Fortunately, there are many of such solutions for it in the market already. If you’re new to this area, read more about it in following articles to get more info about distributed tracing:

Distributed tracing is definitely worth its own ( lengthy ) article but it’s out of scope here.

In our company we chose JaegerTracing as our backend solution ( agent / collector, storage, visualization ) ,in combination with opentelemetry as instrumentation libraries to trace container workload.

Each of our application function latency is recorded and put in a context of same request like this

As usual, there are pros and cons in comparison with our “lower-level” tracing tools.

Pros:

  • Solve our container working tracing problem ! Now tracing works across namespaces / services / hosts boundaries.
  • Provide useful context of an event: between events inside a service, in side a request which runs across different service ( thus a name end-to-end tracing )
  • Solve problem with interpreted / scripting language ( Eg: PHP , Python..): We now can look into specific events instead of execute_ex for PHP or _PyEval_EvalFrameDefault for Python.

Cons:

  • An extra effort to setup a distributed tracing infrastructure , obviously.
  • Performance overhead: eventually it’s a user-space instrumentation, can’t compare with the speed of BPF-based tool. And it depends on language specific implementation of opentelemetry SDK / API. We have to make trade-off between performance vs feature richness here. Though distributed tracing solutions support different sampling mechanism to handle this in a flexible way.
  • Event ( span ) in distributed tracing tool ) is kind of user statically-defined tracing ( USDT ) in kernel tracing world. There’s no dynamic uprobe super power.

Conclusion

In summary, this is what i would recommend you while tracing container workload:

  • Understand the “namespaced process nature” of a container. Use perf / bpftrace tool still to debug the kernel-related issue that may affect your container workload. ( CPU contention / starving, cgroup throttling, unnecessary / abnormal syscalls, extra unwanted IO … )
  • Implement distributed tracing solution to trace application-specific events. Use jaeger and opentelemetry as we do if possible, so we can discuss it and grow together ;)

Hope it helps

Preferences:

--

--

--

Where engineers matter

Recommended from Medium

Static libraries in C

Deployment of SignalR with nginx

GSoC 2021: Polaris Web Reports

Join A Free, All-online, Non-competitive Mathathon Organized Around Some Open Problems

9 Generic steps to becoming JS web developer

Why the 100 Days of Code Challenge is Not Only for Beginners

Building a simple REST API using Python & Flask | Daily Python #14

Polis Project Weekly Updates (08/12 to 08/18)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tungdam

Tungdam

Sysadmin. Amateur Linux tracer. Performance enthusiast.

More from Medium

Terraform:- Infrastructure as Code (IaC)

How to provide access to AWS EKS for Users & Roles (AWS IAM/SSO) and View information from the AWS…

AWS EKS Console View

Terraform — How To Start Writing Infrastructure as Code(IaC)

Terraform plan output.

K8s in Oracle Cloud Always Free tier (with Terraform)