Practical linux tracing part 0/X: why did i trace ?
Recently I had a presentation for my team about bpftrace, what i think it’s very cool. Turns out that nobody likes it. Mostly because of my presentation skill, i guess. Though, the biggest question was asked is why.
Why do i need this thing ?
It’s cool, but it takes effort to learn some “low level things” , why should i do that ?
And I also asked myself, why do i want to share this with everybody ?
This is not a technical post. This is a story.
“What will you do if this service is broken ? you CALL ME”
My developer fellow yielded at me on our skype group. When I asked him info in details about the internal implementations of his software. Maybe i asked too much. He got angry and told me so. He was right. Then I shut up.
In our company, we sysadmins run our production services. We’re in charge of it. Paid by it. Take responsible for it. Then I want to know as much as possible about what i am running.
That’s difficult. I can’t write the code they’re doing, obviously. I even can’t understand it. But i should know some symptoms about when things are going wrong. The most frustration moments of my professional life is when i keep looking at my top screen, seeing all CPU cores are consumed, some processes that are using 1xxx % CPU, and don’t know what to do. I still have it these days sometimes.
His sentence banged my head hard. Days after i keep asking myself: what am I doing here ? Do what developers tell me as a robot ? Live with black box services forever and just call our developer when something’s wrong ?
Yes. I had been doing so for some time. Till one day, one of our critical service had an intermittent issues that nobody knew why, even its creator.
I spent several days looking at atop
output, checking every change in each 10 seconds sample that we had around the time that problem occurred. Every difference between 10 seconds, I took a note about it. Then somehow i found the page scan rates was significantly high around the time of problem, which led me to this great article about NUMA and page scan. It set zone_reclaim_mode
to 0 and the problem was gone.
I was very happy, then started telling myself that this is exactly what i need to do more. I started digging more into our system. And i was lucky because that time we had a system with good scale and a very professional team that could do a lot of optimization. With the help of the developer i mentioned above ( he’s one of the coolest one i have worked with so far ), i learned a lot about filesystem, pagecache, huge pages, memory allocation ( tcmalloc, jemalloc ) , JVM ( gc and JIT compiler ).
But my tweak usually didn’t work well. I couldn’t measure it precisely. Because most of the things in my area of responsibility is in side the kernel. I realized that i just can’t poking around randomly anymore.
I need to understand Linux kernel more. How ? Read books. This one and this one. Okay, but it’s super slooooow progress. It’s very difficult for me to understand all the concepts, not to mention code, data structure and algorithm inside. I need some thing more…practical. Some thing that can give me a better view of my system. In action.
That’s when i found Brendan Gregg blog, and i entered Linux tracing area.
Over years, with his great effort with perf, ftrace, then the born of BPF tools, we now can have much better view of our system. We can do things what we ( as a sysadmins, not kernel developers ) only can dream of before.
My NUMA issue mentioned above can be traced easily by numamove.bt.
Want to know per-process pagecache hit / miss ratio ? Use cachetop.
Need to profile Java application ? No problem , Perf can help you this way.
How to analyze the CPU contention that may affect my service ? You need something like 30 lines of code with bpftrace version of offcputime.
With the help of tracing tools, now you will be able to know what is on cpu, off cpu, what put your process to off-cpu state !
With the grow of community, for tracing both user space and kernel space, now i have more things to check instead of banging my head onto the top screen. I should know which code is being executed, how is it executed and why it is excuted. If i don’t understand the code myself, i can show my fellow developers the info about his code. It’s much better than: your service is using a lot of CPU now.
I believe this should answer the question: why do we need to trace and why do we need to study the “low level” stuffs.
“You should go back and do your job”
Once again. It got this from another developer in my company. When i showed him the flamegraph of his application. And asked him to check it, something didn’t look right. Because he told me before that i need to dig deeper instead of just calling him.
That’s true. It’s not my job. I should care about kernel part only. And once again, i shut up. He continued working on that, but still now, the problem is still a mystery.
2 years later,
Till now, I still think we should trace it :) , if it can help to solve our problem. If we ( both sysadmins and developers ) tried everything and nothing helped, then why not ?
Perf_events, uprobe or UDST may reveal something that application metrics / log may miss. On-the-fly event tracing can be very helpful with source code analysis.
The best scenario in this case is to work together. By using Linux tracing tools, i can give developers internal activation records of his process , with the ability to zoom in many functions with proper debug symbols. And based on that info, he can check more in his source code, to compare what’s in his mind versus the reality. If he understands the benefits and trust my Linux tracing tools.
The thing is, most of standard tools are not mine, it was developed by professionals in giant tech companies, with many different use cases. And i understand the mechanism in background and trust it.
What i need to do now is to introduce it with people, and convince them to use it. This answers the question: why do i need these Linux tracing tools with everybody.
To understand it better, to troubleshoot quicker, to debug more precisely, together with my dear developers.
To provide run-time statistics, for better troubleshooter than I am can help to solve problem.
I failed at this so far. Thus have this blog then.
Conclusion
If you read this far, i hope that you can see why did I trace and have your own answers for other questions that i raised in the beginning of this blog.
I made this draft half year ago, then never sat down to check it again to finish it till now. Since that time, i had other projects came up that took a significant amout of my time for tracing, but everytime i touch it again, i’m really in flow. And tracing job, in turn, did give me back some magic moments that no other jobs can compare. Thus, i’ll keep writing tracing stories no matter if they can ( yet ) convince or impress others or not, because tracing does help me to solve my problems and more importantly, it sparks joy in my jobs.
Then, why do I trace ? To dig deeper, to understand the root cause, to solve hard problems. And for fun.
If you’re interested in those things as well, follow my other articles on tracing and give me your feedback please.