Handling High Load on Server

Rajat Gupta
#FML
Published in
5 min readFeb 20, 2019

This article is the first in the series of articles dedicated to one of my favorite subjects: troubleshooting. I enjoy many things of my job, it’s hard to beat the fun of tracking down a complex server problem when downtime is being measured in thousands of euros per minute. Although, it true that there are many reasons for downtime and many approaches for troubleshooting.

As this article especially, I will start with the most common problem you will run into on Linux servers. I am talking about the server that might have a high load.

I am assuming you know what is load in Linux.”

First, the very start you want to know about the current load in the server by:

$ uptime
21:00:08 up 1 day, 11:31, 1 user, load average: 8.04, 7.48, 8.73

Now I have 4 core CPU as you can see here:

$ cat /proc/cpuinfo | grep processor | wc -l
4

As you see now, Load average is pretty high and if you see 5 and 10-minute metric, those numbers are also high, which means it has been quite some time with high load on the server.

Although, these metrics are not 100% accurate all time. The CPU load metric is 60% correct mostly. So now the second tool I use is top

$ top
top - 21:09:31 up 1 day, 11:40, 1 user, load average: 8.17, 6.91, 7.70
Tasks: 134 total, 2 running, 132 sleeping, 0 stopped, 0 zombie
%Cpu(s): 69.5 us, 11.1 sy, 0.0 ni, 17.8 id, 0.0 wa, 0.0 hi, 0.1 si, 1.4 st
KiB Mem: 16468396 total, 4156744 used, 12311652 free, 361752 buffers
KiB Swap: 0 total, 0 used, 0 free. 2154932 cached Mem

As we can see, the user CPU load is very high which is 69.5%, which is a serious concern, which means any running application in user address space is causing the issue.

To further debug the main cause of high load, Now I will check if this is caused by high RAM:

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 12385916 361904 2164028 0 0 2 116 34 60 27 6 33 0 34
4 0 0 12393340 361904 2164104 0 0 0 356 3424 5195 27 6 9 0 58
10 0 0 12411256 361904 2164124 0 0 0 856 3902 7649 34 2 0 0 64

As we can see, there is no swap usage, memory is also free. So it doesn’t seem like the culprit. We will check network statistics to check if the network load is high:

$ sar -n TCP,ETCP,DEV 1
Linux 3.16.0-4-amd64 (ip-10-163-128-39) 02/20/2019 _x86_64_ (4 CPU)
08:52:58 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:52:59 AM lo 2.00 2.00 0.11 0.11 0.00 0.00 0.00 0.00
08:52:59 AM eth0 2.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00
08:52:58 AM active/s passive/s iseg/s oseg/s
08:52:59 AM 0.00 0.00 4.00 2.00
08:52:58 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
08:52:59 AM 0.00 0.00 0.00 0.00 0.00
08:52:59 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:53:00 AM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
08:53:00 AM eth0 1.00 1.00 0.05 0.64 0.00 0.00 0.00 0.00
08:52:59 AM active/s passive/s iseg/s oseg/s
08:53:00 AM 0.00 0.00 1.00 1.00
08:52:59 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
08:53:00 AM 0.00 0.00 0.00 0.00 0.00
08:53:00 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
08:53:01 AM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
08:53:01 AM eth0 3.70 5.56 0.25 1.57 0.00 0.00 0.00 0.00
08:53:00 AM active/s passive/s iseg/s oseg/s
08:53:01 AM 0.00 0.00 3.70 5.56
08:53:00 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
08:53:01 AM 0.00 0.00 0.00 0.00 0.00
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: lo 0.79 0.79 0.04 0.04 0.00 0.00 0.00 0.00
Average: eth0 1.97 1.57 0.11 0.58 0.00 0.00 0.00 0.00
Average: active/s passive/s iseg/s oseg/s
Average: 0.00 0.00 2.76 2.36
Average: atmptf/s estres/s retrans/s isegerr/s orsts/s
Average: 0.00 0.00 0.00 0.00 0.00

The network seems to be pretty idle and there is no error in TCP which means this is not the issue. To debug further, we will check disk stat:

$ iostat 1
Linux 3.16.0-4-amd64 (ip-10-163-128-39) 02/20/2019 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
61.14 0.00 4.68 0.05 26.70 7.43
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 116.66 6.15 4387.99 1050102 66260046
avg-cpu: %user %nice %system %iowait %steal %idle
70.00 0.00 0.00 0.00 0.00 30.00
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 124.00 0.00 4100.00 0 100

Here I see that disk usage is very high and something is being written to Disk which is causing high disk usage and hence high load.

Above methodology is known as USE(Utilization, Saturation, Errors) method. Here we tried to find the source of the issue so that we can know where is problem lies actually.

Now I will check the process sorted by high CPU by top:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
30064 noc 20 0 3898928 136308 6796 S 266.2 0.8 82:12.62 beam.smp
29782 noc 20 0 4287384 336820 7128 S 80.2 2.0 20:56.22 beam.smp
30095 root 20 0 31336 7836 2212 R 16.8 0.0 168:42.70 pen

I can see my culprit here. I can see here pid 30064 is writing something on the disk which is causing high load.

I will further investigate by strace . Remember, starceis a very dangerous tool, it can hang your process and might terminate it. So use only you have no option left.

I will run only for first 100 entries, otherwise, it will fill the screen and slow down your system.

$ strace -tp 30064 2>&1 | head -100
Process 30064 Attached
write(.....)
.....

I can see, something is being written and it is in the loop(I can not show the actual content ;-)) which is causing high disk IOPS.

Now, I will ask my development team, did we change something and there we know our answer, the development team said we released a feature which tries to write data by bits in a loop to disk.

Now I asked my development team to fix the bug and release the patch. Now I can see my CPU load after the patch release. Everything looks good.

$ top
top - 21:21:19 up 1 day, 11:52, 1 user, load average: 2.62, 5.29, 6.49
Tasks: 136 total, 1 running, 135 sleeping, 0 stopped, 0 zombie
%Cpu(s): 4.0 us, 1.1 sy, 0.0 ni, 94.6 id, 0.1 wa, 0.0 hi, 0.1 si, 0.2 st
KiB Mem: 16468396 total, 3946628 used, 12521768 free, 361880 buffers
KiB Swap: 0 total, 0 used, 0 free. 2163620 cached Mem

The gist of above, we first checked for how long load is there. Is it caused by memory, disk IO or network and then we looked at why the particular resource is on high load.

Overall, to debug an issue, rather jumping on the process and say this process is causing the issue, first try to find the source of the issue and try to debug why that issue is happening.

I will be publishing more articles on these topics. So stay tuned. Till that time, happy debugging. ;-)

--

--