Bringing Sanity to System Performance Analysis
We’re a small team, about 30-ish engineers, and we process over 250,000 writes per second into our system on an average day. At this traffic level, we routinely see mysterious performance problems, so it’s important that our engineers have a better handle of how a system performs than the average web developer.
Since we emphasize a learning culture, we run regular training sessions and have processes for pushing around information and experience. One such document that came out of this is a quick and dirty guide to system performance analysis, which may be of use (or entertainment value) to some of you poor souls wondering how Mongo could be claiming more memory than exists in the machine, and whether or not you should be worried.
So you want a rigorous somewhat sane way to debug your application’s performance? Look no further, this post is for you.
Operating Systems & Process Accounting
It’s useful to have an idea of how the operating system manages processes in order to understand the metrics that it outputs.
At a super basic level, the operating system (more precisely, the kernel) is a system for coordinating resources (disk, RAM, CPU, network) between programs. You can think of it as an accounting system for what programs want to run, and an arbiter that decides who gets what and when. A process is the basic unit of thing that can do work.
A process is the manifestation of a program, similar to in OOP an instance is the manifestation of a class. There is some data that’s static, such as the program binary and libraries that its uses, and then data that is created and operated on during runtime, such as the stack and heap.
A CPU can only execute one process at a time, including the kernel, so how does it take control of the machine and give time to another process? There are two mechanisms for doing so. Either the process relinquishes control (syscalls), or the CPU forcibly stops execution and hands control to the kernel (interrupts).
Syscalls are generally functions that perform some sort of resource allocation for the process. If a process wants to open a file, then it calls the C open function, which relinquishes control to the kernel. The kernel does its filesystem magic, and returns to the process — maybe. It is possible that the kernel may decide that the best use of the system is not to return control to the process, but rather let another process have a chance at running. Kernels are typically designed to return control back to its processes in a fair way.
Interrupts are usually triggered by hardware events. The most common peripheral interrupts would be keyboard, network, and mouse events. Anytime you hit a key, the CPU stops the current process and runs a special function that the kernel has registered with the CPU for handling keypresses.
In addition to this, a super important interrupt is the kernel scheduler interrupt. In order to prevent a process from going into an infinite loop and never making a syscall, the kernel sets up periodic interrupts at an interval called the scheduler time slice or quantum. This value is typically on the order of tens or hundreds of microseconds.
Once the kernel has control, it needs to decide what to do next. Interactions with hardware usually happen in some asynchronous fashion, so if a process makes an open call, the kernel may send a bunch of commands to the disk controller, then have to wait for the controller to respond (which it does via an interrupt). In the meantime, the kernel may as well put another process onto the CPU, or perhaps send some data out on the network, or draw something on the screen.
To do its job, the kernel then needs to keep track of the state of every process. Not all processes actually need to be running on the CPU all the time. In fact, most don’t. They’re waiting for input either from a human or from some hardware resource. The kernel models this as states that a process can be in: running, runnable, uninterruptible wait, sleep, zombie (I’m probably missing something). A process that’s running is actually on the CPU, whereas a process that’s runnable is not on a CPU, but could absolutely do work if it got CPU time. In other words, it’s not waiting on a syscall to return. A process in uninterruptible wait is usually waiting for disk I/O, so I will interchangeably call this disk wait; network I/O is considered interruptible, and thus does not count towards uninterruptible wait. Finally a process that’s sleeping is not runnable and not waiting on disk, so is typically waiting for user input, in a sleep loop, or waiting on network.
A Detour Into Memory Management
There are two major memory utilization optimizations that modern kernels do. The first is that they recognize that a lot of code is shared. For example, two programs that load in the same library don’t both have to be holding copies of their code, since that’s static data. The second thing is that the sum total of memory all processes need can exceed the amount of actual RAM in the machine if the kernel is clever about how it allocates those physical pages of memory.
The memory model that the kernel and CPU expose to programs is that they have a never ending contiguous block of bytes starting at address 0. This is a problem if address 0 is actually RAM’s first byte, because many processes would think they have access to that location. So what the kernel does is this sleight of hand. It maintains a map of (process, virtual memory location) → physical location in RAM. So it might say: process #2’s virtual addresses 0 through 4095 map to physical RAM locations 26624 to 30719. When the process tries to access byte 4000 in its own memory space, it actually ends up accessing 30624 (26624 + 4000). But how does it accomplish this? Remember that once a program is on the CPU, it stays on the CPU until it relinquishes it or there’s an interrupt. It would be a large amount of overhead if the process were to make a syscall for every single memory access.
Instead, the CPU provides some hardware support in the form of a Translation Lookaside Buffer. The idea is that the kernel tells the CPU, ok I’m going to run process #2 and I want you to map its virtual memory access for bytes 0–4095 to physical memory at 26624–30719. That set of mappings is called the TLB, and when the process accesses location 4000, the CPU does the work of mapping that to physical RAM, completely bypassing the kernel.
This only works, though, if the TLB can account for all of a processes memory. In reality, the TLB isn’t very big, so what happens is when the process accesses a location that isn’t in the TLB, called a miss or minor page fault, the CPU generates an interrupt and asks the kernel to give it a new mapping that satisfies the process’s access.
Actually, that’s a lie. Modern CPUs have the ability to hold the page mapping themselves, called a page table, in an area that’s not as fast as the TLB, but also not as slow as calling the kernel. (I think the page table is held in regular RAM?)
Ok so what’s this clever thing the kernel can do? If a process accesses memory that isn’t in the TLB and also isn’t in the page table, this generates what’s called a major page fault. It’s simply another interrupt that runs a little bit of kernel code. Now the kernel has a chance to modify the page table and then tell the CPU to retry the memory access. What the kernel can do is maintain a mapping of virtual memory locations to blocks on a disk, so that on a major page fault, it can load in a block of data from disk into memory, update the page table, then tell the CPU to retry the access. There’s nothing that says the number of blocks the kernel could potentially allocate to memory has to be constrained by the size of memory, so the kernel can in essence fake a large addressable space by backing some of it by disk. That’s what swap is.
So that’s how paging works. The reason for the detour should be clear: page faults are an important metric because accessing the page table is orders of magnitude slower than accessing the TLB, and accessing disk is in turn orders of magnitude slower than accessing the page table.
Shared Memory Regions & Memory Mapped Files
This isn’t too important from performance debugging purposes, but it’s nice to know. The kernel is smart enough to know that some programs share a bunch of data that can’t be modified, normally binary object code like the executable itself. Processes can get assigned read-only chunks of memory (called pages) that are shared with other processes. In some cases, these pages can be marked copy-on-write, which means that the kernel thinks the processes aren’t going to modify the pages, but if one tries to then it makes a copy of the page before allowing the process to start messing with it.
When you run top, you’ll see VIRT, RES, and SHR columns. VIRT is the total size of the virtual address space assigned to a process, which is not necessarily the amount of physical RAM it’s taking. RES is the amount of physical RAM it’s taking. SHR is the amount of physical RAM that’s shared between processes.
Memory mapping is a really handy technique for when a process wants to load in a really big file. The way file access naively works is that the process will “open” a file, getting a file descriptor, then make a series of “read” calls getting data one chunk at a time. The “open” and each of the “reads” is a syscall, which causes the process to lose control, and the kernel to take over and reschedule things. Then the kernel reads in the data into a buffer, and then copies the data from that buffer into a buffer that’s owned by the process. Lots of wasted work here.
A memory mapped file is a way for the process to say, “hey kernel why don’t you just map the entire file into my virtual address space.” So the kernel then uses the same mechanism that it uses for major page faults and swap to assign all the data to the process in one go. That’s a single syscall, and requires no copying of data from buffer to buffer.
Mongo (previous to WiredTiger) works this way, as does Varnish. This is why you’ll see Mongo taking enormous amounts of VIRT memory, far exceeding the physical RAM on the machine. Do not be alarmed. The kernel treats Mongo’s access to data that’s not in RAM the same way as it handles all its major page faults, so this happens seamlessly. For Mongo, page faults are a good indicator of high load. The mongostat tool, in fact, measures page faults.
Filesystem Cache & Free Memory
If the machine has available RAM, the kernel does something smart and just copies data from disk into RAM so that if a program accesses it then it might be handy. This is called the filesystem cache. If a process needs the RAM that has been allocated to the cache, the kernel simply evicts cached data.
The swap line in “top” on our systems usually says 0k total, meaning we have swap disabled on our machines. We don’t want our machines ever hitting swap since if it does then it’s probably unresponsive. The only useful number is “cached” which is the fs cache just described.
In top, the mem line shows: total, used, free, and buffers. Total is the total amount of physical RAM, used is what’s used by all the programs (note this isn’t the sum of RES and SHR, because SHR is shared), free is free RAM, and buffers is space devoted to handling socket connections and copying between processes.
Threads & Context Switching
Processes get separate address spaces, so they can not interfere with each other by design. But sometimes you want them to interfere with each other, like you want them to coordinate through some shared memory region, so the kernel provides specialization of a process called a thread. A thread, to the (Linux) kernel, is a process that shares its page tables with another process, typically a child. I say (Linux) because in other OS implementations, there may be other distinctions, like how threads get scheduled. Linux has a 1:1 scheduler, which means that it handles putting threads on a CPU in the same way as processes.
The act of moving a process or thread onto a CPU is called context switching. A bunch of memory related things happen. First, the context switch is triggered by an interrupt or syscall to the kernel. The kernel runs under a special privileged mode so that it can handle interrupts, manipulate the TLB, and mess with hardware. There’s some cost associated with the switch into this mode. Next, when the kernel puts a new thread or process onto a CPU, it has to invalidate the old TLB. The CPU also evicts instructions from its fast L1 and L2 caches. L caches are small amounts of super fast memory that sits on the CPU chip. It’s faster than regular RAM because it doesn’t have to travel down the same pathway as data in RAM does to get to the CPU. So if you have a small amount of code that sits entirely in L1 or L2 cache, then that code can potentially run blazingly fast (many data compressors such as lz4, snappy and blosc are designed to fit in L1 cache). Context switches flush these caches.
Thus there is a high cost to context switches and you want to avoid them if possible. The general strategy of avoiding context switching is to batch up work. Rather than calling “read” 1,000 times asking for one byte, it’s better to call “read” once and ask for 1k. Systems like Kafka use these techniques in a smart way.
As a side note, cache locality is a really big deal, so much so that much of what you learn about basic algorithms can turn out to be wrong. For instance, inserting into the front of an array is O(n), whereas inserting into the front of a linked list is O(1). But if the entire array fits inside of L1 cache, then it could be faster than a linked list, which disperses memory around RAM and probably won’t sit entirely inside L1. We throw away constant factors on paper, but in reality they do matter.
Now we’ll get to some actual metrics! By now, you’ve no doubt logged into a machine and ran top and seen something like:
top - 17:29:19 up 2 days, 2:40, 1 user, load average: 0.00, 0.02, 0.05
Tasks: 116 total, 3 running, 113 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 1.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2049700k total, 1213348k used, 836352k free, 105212k buffers
Swap: 0k total, 0k used, 0k free, 703284k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1241 mongodb 20 0 340m 42m 8996 R 1.3 2.1 58:02.93 mongod
8 root 20 0 0 0 0 R 0.3 0.0 2:01.02 rcuos/0
1308 www-data 20 0 52680 5888 468 S 0.3 0.3 0:54.57 nginx
2722 wes 20 0 77664 1740 864 S 0.3 0.1 0:15.14 sshd
2832 wes 20 0 30744 6132 1224 S 0.3 0.3 3:45.79 tmux
11766 ubuntu 20 0 206m 13m 3848 S 0.3 0.7 0:00.17 python
1 root 20 0 24464 2436 1336 S 0.0 0.1 0:00.56 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
What does this all mean?
The load average, on Linux, is the number of processes in the running, runnable, and disk wait state (on some Unixes, disk wait doesn’t count).
- Page faults count as waiting for disk.
- The 3 load numbers you see are 1, 5, and 15 minute averages.
- The letters in the S column are state: R for runnable, S for sleep, D for disk wait.
- A process that’s churning away computing stuff with no syscalls generates a load of 1.
- A process that is not running, but is getting its memory paged in from disk generates a load of 1. Similarly, a process that’s spending all its time writing to a log file, or unable to write to a log file and blocking (disk full and does not detect error) generates a load of 1.
- A process that is waiting on a network response is not in a runnable state and is not waiting for disk, thus is counted as interruptible and generates a load of 0.
The columns of the CPU row:
- us: user time, the amount of time that processes are spending on CPU, not counting work the kernel is doing on the process on behalf of a process.
- sy: system time, the amount of time on CPU the kernel is spending on behalf of servicing process syscalls.
- ni: nice time, amount of CPU time spent on low priority processes. I’ve never used this.
- id: idle time, 100% minus the sum of the rest of the CPU percentages.
- wa: wait time, amount of time spent waiting for disk IO
- hi: hardware interrupt time, amount of time spent servicing hardware interrupts
- si: software interrupt time, what it sounds like
- st: CPU steal. This one is funny. It doesn’t exist on bare metal, but exists in a virtualized environment. The host OS is allowed to take CPU time away from the guest machine, so the guest machine reports this as stolen. You should usually see low st numbers, like < 1%, as long as you’re using adequately sized instances in EC2. High numbers are a problem. (This is a bit tricky with ec2, they do some things that may look like steal is an issue but it’s what they use to throttle your cpu cycle to the instance assigned amount.)
So let’s think about a worker application. If a worker is spending the majority of its time computing something then it will generate a load of 1. If a worker is spending most of its time writing to disk, and none of its time computing anything, it will still generate a load of 1. This is problematic -- how do you tell between the two situations? That’s where the %CPU column comes in. A CPU bound worker will show near 100% CPU (or conversely, near 0% id on the Cpu line). A disk bound worker will show near 0% CPU. A network bound worker will show neither -- CPU will be less than 100%, and load will be less than 1, assuming no other processes running.
- A CPU bound process will generate load 1 and show ~100% CPU
- A disk bound process will generate load 1 and show ~0% CPU
- A network bound process will generate load < 1 and show CPU < 100%.
Thus, a worker that does no disk IO will be either #1 or #3. A database server would look like #2 if it’s serving requests from disk (a bad sign for performance), but #3 if it’s serving requests from memory. We saw Riak do #1, which is super unusual for a database. Usually the network or the disk is the limiting factor, not CPU.
Also note when MongoDB is getting a bunch of requests that are going to disk, its load will skyrocket. This is because it spawns multiple threads, so each one can potentially contribute a load of 1.
Tools & Tricks
top: The most generally useful tool. It gives you CPU and memory usage (hit ‘M’ to sort by resident memory), and load averages. The one thing it lacks is network usage. When you start top up, you should immediately hit ‘1’, which will show all CPU cores, and then hit ‘H’, which will show all threads. Otherwise, top will collapse CPU numbers, as well as threads into their parent process. Collapsed numbers can be misleading. For example, if a process consists of two threads that are doing 60% of CPU each on separate cores, you’ll see that the process is taking 120% of CPU unless you hit ‘H’. Consider too a process that’s two threads, one of which is CPU bound, and the other is 20%. It will also show 120% CPU in the default collapsed view. CPU boundness is really important to know about, because that’s where you should focus your efforts.
bwm-ng (or iptraf): Network monitoring. I prefer bwm-ng because it’s fewer characters to type (vs. iptraf -d eth0), and it also works in small terminals. iptraf requires something like 40 char wide terminal to run, and if you’re like me running tmux with several panes then that’s not an option. The stupid thing about bwm-ng is that it defaults to updating its numbers every half second, which is not useful since traffic is often times bursty. So I immediately hit ‘+’ a bunch of times -- each one adds 0.1 seconds to the period.
vmstat: Shows useful memory stats. Run as “vmstat 1” to get 1 second updates. vmstat will show you overall memory usage and paging behavior. It also tells you how many blocks are being read in/out of disk, and CPU use. The cs column is interesting. That’s the number of context switches per second, and elevated numbers are bad.
strace: Traces process syscalls. Run as “strace -p PID”, and you get a flurry of realtime data about what syscalls a process is taking. If a process appears to be hung, it will sometimes be hung on some particular syscall, so this is how you’d find out. It’s rare this provides any insight, but when it does it’s really useful.
telnet: Sometimes you aren’t able to connect to a database or API service. The first thing to do is to check connectivity to that machine and port, and the easiest way is with telnet. Run as “telnet HOST PORT” and if there’s a clear connection then you’ll see lines like:
Connected to localhost.
Escape character is '^]'.
If there is no network connectivity then the command will hang for some period of time. This usually means either the program you’re connecting to is not up, or the security group isn’t configured to allow access from the machine you’re on.
dig: DNS lookup. Important in our environment because we don’t manage DNS the traditional way and it’s possible for a machine (usually your vm) to not have the correct IP address for a host. You can lookup a host then check its IP address against nodelist.yaml to make sure it’s correct. If not, puppet should fix it. dig and telnet are your goto tools for when it appears like you can’t connect to a remote database or server. Make sure its IP is correct, the remote is up, and telnet works.
mongostat: Tells you what’s going on with mongo on the server. Sample output (apologies for the horrible formatting):
insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn set repl time
*54 170 *2126 *0 17 15|0 0 3362g 3380g 146g 12 series_daily2015162:57.7% 0 2|0 0|1 35k 2m 184 series_dailydb SEC 14:10:14
There’s a lot to mongostat that’s more appropriate for another doc, but the one column I want to focus on is “faults”, which in this dump is 12. This is page faults, ie the number of times per second that the OS had to go to disk to lift data into memory. In the case of Mongo, this is an indicator that your working set, ie the amount of data applications need quick access to, has left RAM. Values in the hundreds or thousands are bad, and likely require either resizing the database or redesigning the database model to be more efficient. Just to be clear, this is an artifact of Mongo’s implementation. Page faults is not a good indicator of anything for other databases.
What exactly is a bad page fault number? Since a major fault by definition requires pulling data off disk, you can expect each page fault takes at least one disk seek. Rotational disks seek on the order of 3ms, so a single disk could sustain around 300 seeks per second. Striping a disk doesn’t necessarily improve this number since a disk seek takes the same amount of time, and in fact, it may be slower since you have to seek on every disk in a stripe and so you’re bounded by the slowest disk. SSDs have seek times around 0.1ms.
Mechanical Sympathy is the ability to intuitively know what’s going on with a machine that you develop after years of debugging problems. It’s a term popularized by Martin Thompson, but coined by race car driver Jackie Stewart to describe how drivers become one with their cars. I’ll try to summarize a few things I’ve learned.
- If you try to log into a machine or connect to an API and it hangs, either DNS is pointing to the wrong place, DNS isn’t working (connectivity issue), the remote process isn’t up, the port is blocked, or your route is blocked. If you can telnet to the ssh port (22) then you know the route is clear and DNS is working. If you run dig and it doesn’t work then DNS is broken, and that’s likely a connectivity issue.
- If you log into a machine and typing at the command line is slow or spastic before you hit the enter key, then you’re having network connectivity problems.
- If you log into a machine and typing shows no delay, but then you run a program like “ls” and it hangs, then it’s disk contention. You have two possibilities. Either the disk is messed up, or programs are doing a lot of disk io. If it’s a Mongo server then it’s likely Mongo since it’s hard on the disk. Check load average (hopefully top works!) — it should be high. Similarly, if your database queries are slow and running “ls” on the database server is slow, then you know your working set is too big for RAM and there’s disk contention.
- A machine running at 100% CPU with no disk or network problems generally does not exhibit any behavior you might detect, because most Unix tools or shell commands are I/O bound, not CPU bound.
- If your program is slow but it’s not using a lot of CPU and it’s not doing a lot of disk IO, then you’re getting blocked on external data, most likely a database.
- If your program is hung and you attach strace to it, and you see that it’s stuck on a syscall, then it’s probably waiting for network or disk IO (it may also be waiting on a lock if it’s a multithreaded program).
- If your program runs fine with input size N and then when you try input size 10 * N and it never completes and your system feels sluggish, your OS has likely hit swap and there is now disk contention. It will never complete. Try to redesign the program to use less memory. You can check this with top.
- If your program isn’t outputting a lot to disk, but is waiting on disk, either the disk is full or you’re reading / writing to a ton of files simultaneously (the disk is spending all its time seeking and not actually returning much data).
- Numbers every programmer should know: http://i.imgur.com/k0t1e.png
- Additional resources: http://www.slideshare.net/brendangregg/velocity-2015-linux-perf-tools, Brendan goes over a ton of useful linux cli tools for observing performance
- Netflix’s summary of performance tools: https://media.netflix.com/en/tech-blog/linux-performance-analysis-in-60-000-milliseconds
- USE Method for performance analysis: http://www.brendangregg.com/usemethod.html
- Additional resources: This is an amazing blog post on how the kernel manages memory http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory/