Handling High Load on Server

Published in

#FML

5 min readFeb 20, 2019

This article is the first in the series of articles dedicated to one of my favorite subjects: troubleshooting. I enjoy many things of my job, it’s hard to beat the fun of tracking down a complex server problem when downtime is being measured in thousands of euros per minute. Although, it true that there are many reasons for downtime and many approaches for troubleshooting.

As this article especially, I will start with the most common problem you will run into on Linux servers. I am talking about the server that might have a high load.

“I am assuming you know what is load in Linux.”

First, the very start you want to know about the current load in the server by:

$ uptime
 21:00:08 up 1 day, 11:31, 1 user, load average: 8.04, 7.48, 8.73

Now I have 4 core CPU as you can see here:

$ cat /proc/cpuinfo | grep processor | wc -l
4

As you see now, Load average is pretty high and if you see 5 and 10-minute metric, those numbers are also high, which means it has been quite some time with high load on the server.

Although, these metrics are not 100% accurate all time. The CPU load metric is 60% correct mostly. So now the second tool I use is top

$ top
top - 21:09:31 up 1 day, 11:40,  1 user,  load average: 8.17, 6.91, 7.70
Tasks: 134 total,   2 running, 132 sleeping,   0 stopped,   0 zombie
%Cpu(s): 69.5 us, 11.1 sy,  0.0 ni, 17.8 id,  0.0 wa,  0.0 hi,  0.1 si, 1.4 st
KiB Mem:  16468396 total,  4156744 used, 12311652 free,   361752 buffers
KiB Swap:        0 total,        0 used,        0 free.  2154932 cached Mem

As we can see, the user CPU load is very high which is 69.5%, which is a serious concern, which means any running application in user address space is causing the issue.

To further debug the main cause of high load, Now I will check if this is caused by high RAM:

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 12385916 361904 2164028    0    0     2   116   34   60 27  6 33  0 34
 4  0      0 12393340 361904 2164104    0    0     0   356 3424 5195 27  6  9  0 58
10  0      0 12411256 361904 2164124    0    0     0   856 3902 7649 34  2  0  0 64

As we can see, there is no swap usage, memory is also free. So it doesn’t seem like the culprit. We will check network statistics to check if the network load is high:

$ sar -n TCP,ETCP,DEV 1
Linux 3.16.0-4-amd64 (ip-10-163-128-39)  02/20/2019  _x86_64_ (4 CPU)08:52:58 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:52:59 AM        lo      2.00      2.00      0.11      0.11      0.00      0.00      0.00      0.00
08:52:59 AM      eth0      2.00      0.00      0.10      0.00      0.00      0.00      0.00      0.0008:52:58 AM  active/s passive/s    iseg/s    oseg/s
08:52:59 AM      0.00      0.00      4.00      2.0008:52:58 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
08:52:59 AM      0.00      0.00      0.00      0.00      0.0008:52:59 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:53:00 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:53:00 AM      eth0      1.00      1.00      0.05      0.64      0.00      0.00      0.00      0.0008:52:59 AM  active/s passive/s    iseg/s    oseg/s
08:53:00 AM      0.00      0.00      1.00      1.0008:52:59 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
08:53:00 AM      0.00      0.00      0.00      0.00      0.00
08:53:00 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:53:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:53:01 AM      eth0      3.70      5.56      0.25      1.57      0.00      0.00      0.00      0.0008:53:00 AM  active/s passive/s    iseg/s    oseg/s
08:53:01 AM      0.00      0.00      3.70      5.5608:53:00 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
08:53:01 AM      0.00      0.00      0.00      0.00      0.00Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
Average:           lo      0.79      0.79      0.04      0.04      0.00      0.00      0.00      0.00
Average:         eth0      1.97      1.57      0.11      0.58      0.00      0.00      0.00      0.00Average:     active/s passive/s    iseg/s    oseg/s
Average:         0.00      0.00      2.76      2.36Average:     atmptf/s  estres/s retrans/s isegerr/s   orsts/s
Average:         0.00      0.00      0.00      0.00      0.00

The network seems to be pretty idle and there is no error in TCP which means this is not the issue. To debug further, we will check disk stat:

$ iostat 1
Linux 3.16.0-4-amd64 (ip-10-163-128-39)  02/20/2019  _x86_64_ (4 CPU)avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          61.14    0.00    4.68    0.05   26.70   7.43Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda             116.66         6.15       4387.99    1050102   66260046avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           70.00    0.00    0.00    0.00    0.00  30.00Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda             124.00         0.00       4100.00          0        100

Here I see that disk usage is very high and something is being written to Disk which is causing high disk usage and hence high load.

Above methodology is known as USE(Utilization, Saturation, Errors) method. Here we tried to find the source of the issue so that we can know where is problem lies actually.

Now I will check the process sorted by high CPU by top:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
30064 noc       20   0 3898928 136308   6796 S 266.2  0.8  82:12.62 beam.smp
29782 noc       20   0 4287384 336820   7128 S  80.2  2.0  20:56.22 beam.smp
30095 root      20   0   31336   7836   2212 R  16.8  0.0 168:42.70 pen

I can see my culprit here. I can see here pid 30064 is writing something on the disk which is causing high load.

I will further investigate by strace . Remember, starceis a very dangerous tool, it can hang your process and might terminate it. So use only you have no option left.

I will run only for first 100 entries, otherwise, it will fill the screen and slow down your system.

$ strace -tp 30064 2>&1 | head -100
Process 30064 Attached
write(.....)
.....

I can see, something is being written and it is in the loop(I can not show the actual content ;-)) which is causing high disk IOPS.

Now, I will ask my development team, did we change something and there we know our answer, the development team said we released a feature which tries to write data by bits in a loop to disk.

Now I asked my development team to fix the bug and release the patch. Now I can see my CPU load after the patch release. Everything looks good.

$ top
top - 21:21:19 up 1 day, 11:52,  1 user,  load average: 2.62, 5.29, 6.49
Tasks: 136 total,   1 running, 135 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.0 us,  1.1 sy,  0.0 ni, 94.6 id,  0.1 wa,  0.0 hi,  0.1 si,  0.2 st
KiB Mem:  16468396 total,  3946628 used, 12521768 free,   361880 buffers
KiB Swap:        0 total,        0 used,        0 free.  2163620 cached Mem

The gist of above, we first checked for how long load is there. Is it caused by memory, disk IO or network and then we looked at why the particular resource is on high load.

Overall, to debug an issue, rather jumping on the process and say this process is causing the issue, first try to find the source of the issue and try to debug why that issue is happening.

I will be publishing more articles on these topics. So stay tuned. Till that time, happy debugging. ;-)

Handling High Load on Server

Written by Rajat Gupta