Practical Linux tracing ( part 3 ): Strategies for adhoc analysis

Published in

Coccoc Engineering Blog

6 min readJun 30, 2021

How to pick the right event[s] to drill down.

Preface

If you’re reading this blog, most likely you’re fan of Linux tracing and BPF based tools. Given its power, it’s very tempting to jump into your new shiny tools box to troubleshoot problems. Sometimes it works, but mostly it doesn’t. In this blog, I’ll share my own strategy to get the best of our tracing tools. When should we use it and how.

This is the 3rd installment of my “Practical Linux tracing” series. See the previous ones here:

Why did i trace ?

Symbols, debug symbols and stack trace

Tracing vs profiling

The very first strategy: think before you start

You always start with an assumption about what happen ( event[s] ) . Tools help you to verify it.
Check automated alert with metrics, simple graphs for basic resource consumption, application logs. If those tools can’t clearly verify your assumption. Then start tracing down more.
Don’t start firing BPF tools immediately like an idiot.

An easy exception: when you know what to do

Yeah, it’s simple , right ? Either you did it before ( and it works ) or you have a very clear simple task need to verify, and it requires tracing tools to solve. Just nail it.

For example, recently I had to assess the impact of multiple libwebp vulnerabilities . Frankly say i don’t know much about this stuff, and most likely we don’t use it. But there are some PHP application on some nodes load it into its memory:

grep libwebp /proc/25760/maps
7f7bc6af3000-7f7bc6af6000 r--p 00000000 09:01 38281904                   /usr/lib/x86_64-linux-gnu/libwebp.so.6.0.2
....

Now we need to know if there’s any function from that lib is really executed at any time or not. Great case for showing off, bpftrace comes to rescue:

bpftrace -e 'uprobe:/usr/lib/x86_64-linux-gnu/libwebp.so.6.0.2:* {time("%H:%M:%S "); printf("%s %d\n",comm,pid);}' | tee /tmp/libwebp.trace

I let it run for days. Not even a single event. We can rest assured and waiting for the fix.

Sounds nice ? But most of the time, it’s not the case. You usually don’t know exactly what to do. Your metrics and application logs are not enough. When your 2/3 pillars of observability failed, an other easy choice to start tracing. But to trace what ? That’s the question.

https://cdn2.iconfinder.com/data/icons/project-management-1-20/65/45-512.png

Bombarding strategy

Start from the broad view, collect as much as you can, then choose specific event[s] to drill down based on:

Hot code

Profiling and visualizing ( perf / bpftrace profile + flamegraph , for example ), to get the full picture of on-CPU usage, pick the hot-spot ( most frequently executed function, Flamegraph doesn’t show anything related to latency) and start drilling down to that specific function.

For example:

We once got a situation with our squid proxy when there was a sudden growth on our incoming traffic in the evening. The traffic could reach half of its capacity then dropped down like this:

The CPU usage mostly from %sys

Info about TCP

Just thinking something got stuck with TCP , start a quick profiling and got the hot code

It called inet_csk_bind_conflict and inet_csk_get_port much more than usual. As we’re been working with squid proxy for quite a while, this should be related to the situation when squid have to handle much more than 50k connection to the remote host. The limit here is the source port ( in our case ip_local_port_range set to 15000 65000 ). Squid itself does return some error message about “can’t not bind socket” but not significantly more frequently than usual, which is pretty strange.

A simple fix is to avoid solving bind conflict, by adding more IP to interface or even add more interface to the box so it can handle the proxies requested more. We already add more IPs to our interface but turned out it’s not enough, so we added more and problem’s solved.

Slow code

Collecting a broad group of events, with info about its latency, pick out the ones which have a latency outlier and start examine its details. The same technique as implemented by funcslower.py or funcslower ( ftrace version ) .

An important note on this: the slow code most likely is the result of being off-CPU ( voluntary to wait for IO or involuntary by CPU scheduler ) . So sometimes tracing a slow code turns out to be tracing context switch related events.

Bombarding strategy’s biggest problem is overhead. Profiling too many events at the high rate can affect your performance significantly. But doing it in the low rate may lose the important details.

To solve that issue, we need to find the way to narrow down our possible targets.

Narrowing down the target

By examining the source code. Don’t worry, it’s not that complicated as it sounds. We can do it by:

We can always start from application log. By using error message or info from access log of your application as the query to search from source code, we’ll find exactly the function which produces the message to drill down.
Searching by function feature: normally, function name is quite self-descriptive. By searching the source code with some keyword of our feature we’re interested in, we can find the related function quite easily. I personally find livegrep or cscope are very useful for kernel source searching. And github search feature is also good enough in general.

After having the targets narrowed down, we need to reproduce the bug / issue ( isolated env is preferred ) then start tracing those events for further investigation.

For example, we once met a weird problem with bind dns, when the negative record caching doesn’t behave as expected.

A simple search query like this lead me to a function definition we may need: https://github.com/isc-projects/bind9/search?q=negative+cache

dns_ncache_* should be our target for tracing.

A quick check on named ‘s libraries will show us more info about our tracing target:

for x in `ldd /usr/sbin/named | awk '{print $3}'`; do  objdump -T $x | grep dns_ncache && echo $x; done
00000000000a2cc0 g    DF .text 000000000000001e  Base        dns_ncache_add
00000000000a2ce0 g    DF .text 0000000000000022  Base        dns_ncache_addoptout
00000000000a2d10 g    DF .text 000000000000093e  Base        dns_ncache_towire
00000000000a3650 g    DF .text 0000000000000441  Base        dns_ncache_getrdataset
00000000000a4040 g    DF .text 0000000000000392  Base        dns_ncache_current
00000000000a3aa0 g    DF .text 0000000000000597  Base        dns_ncache_getsigrdataset
/lib/x86_64-linux-gnu/libdns.so.110

Having targets on hand, we can those events while replaying our requests to reproduce the case.

I actually couldn’t figure out the root cause by tracing because my colleague was quicker to find the root cause with this buggy patch when min-ncache-ttl swapped minttl with maxttl in the wrong way. But we were close, and having knowledge about the events, we can verify the assumption easily on the fly.

Conclusion

It’s difficult to form a formula to solve ad hoc problem. As it’s ad hoc, we should just aim for accumulating experiences along the way. The best strategy turned out to be improvising only :)

What I wrote above is a small summary about how did i deal with my case along the way that is reusable. Sometimes we still need luck ( and hours and hours of researching ) to really figure out the root cause, especially the root cause of an interesting issue. Hopefully this blog can give you some clues to start on the very first steps of your tracing journey. Looking forward to your feedback on this, as well as to listening to your war stories on your own way.

Thank you for reading!