Shell Commands 05: Linux Server Resources

Data Analysis Skills 06

BW L.
Data Engineering Insight
5 min readDec 27, 2021

--

This is the fifth article of the shell commands series of data analysis skill. Below is the list of published articles.

When you are using a Linux server that is shared with other users, things can get crazy and you might often want to know what’s going on in the system when your ssh session is extremely slow. Normally slowness of a server means the operation system is struggle to keep up with usage of CPU or memory. This post introduce some of the commands that can help you find out the resource usage of the Linux server.

Any computer, no matter it’s a laptop or a powerful server, has the same basic components that makes it a powerful tool. CPU for computation, memory for fast read and write by programs, disk space that stores operation system and user files.

Principals of using a shared server

Below are some principals to keep in mind to avoid bringing down a shared server or causing trouble for yourself:

  1. Do not use excess amount of memory (over 10G). If large memory is needed, you probably want to use a distributed system like Hadoop or BigQuery.
  2. Do not use a lot (more than 10) of CPU threads for a long time ( > 1 hour). This can easily happen while you are not aware of. For example, you are training a model that needs lots of parallel computing. Tools like h2o can use up all the available CPUs when training a tree based model. Or you might be using joblib with python to increase parallelism. In this kind of case, consider to run h2o on Hadoop or get a dedicated server for your work.
  3. Do not save data (Pandas dataframe for example) to disk if the data has millions of records.
  4. Do not copy someone else’s .bashrc without knowing what it does.
  5. If the server is part of a Hadoop cluster. Do not use a spark driver that has over 5 G memory when using yarn-client mode. It is a false assumption that “more driver memory = faster processing”.

CPU and memory usage

The top command

The top command is useful to check what processes are taking a lot of CPU or memory.

Simply type top and enter. The terminal will become interactive and show top processes of all users on the system. We’ll call this the “top screen”. You can then sort the processes by usage of CPU or memory. Now, different OS might have different subcommands for a certain operation inside the top screen. Check the manual on the server using man top command.

You might find someone using 100% of CPU or 100GB of memory. You can then contact the user or report to system admins.

Use the -u option to check specific user’s CPU and memory usage. For example, you know that your teammate John with username john is using the same server and you want to check if he is using a lot of resources.

top -u john

The free command

The free command will quick tell you how much memory the system has, used and available. -g option will print out using GB as unit.

The ps command

Every process/command of the system will have a unique process ID (PID) that identify the process. You can find the PID of a specific process in the top screen. But the top screen by default doesn’t show the full command for this process. You can then use ps command to find out what exactly a command is.

The ps command is very useful to check details of processes running on a server.

For example, in a Hadoop environment, you found out that a Java process is using lots of memory. The PID is 12345 on the top screen. Below command will check the details of the PID 12345:

ps -ef | grep 12345

The output on a RedHat Enterprise server might look like:

john  12345 12344  5 Jan25 ?        03:00:11 java ... org.apache.spark.deploy.SparkSubmit --master yarn --conf spark.executor.memory=20g --conf spark.master=yarn-client --conf spark.driver.memory=80g --num-executors 10 pyspark-shell

The output is tabulated with multiple columns, since use grep command with the specific line, the header of the output is not printed. You can use ps -ef | head to find out what’s the header of the output. In this example, columns are: UID PID PPID C STIME TTY TIME CMD. The last column is what we care about: the actual command that was run. You can see that this is a spark job using 20gb executor memory and 80g driver memory. Since it’s using yarn-client mode, the driver is running on the server! This is violation of the principal 1 above.

Disk space: the du command

Every server has a limited amount of storage space. A typical server in AWS EC2 or Google Compute Engine will have 100GB of disk space. On a shared server, user home directories are normally located at /home/user_name. Over time, /home will grow and users are normally not aware of this. When this folder is full, EVERYONE who is using the server will not be able to run anything. This section introduce basics of the du command to check disk usage of the system.

Syntax: du [options] [target folder]

If you just type du without any options, it’ll print out all files in current folder recursively and the size of each file will be in byte. This obviously is problematic since you can have thousands of thousands of files under your home folder.

The most commonly used options for the du command are the s and h options to

du -sh will print out total space used of current folder in human readable format (GB, MB, etc.). You can append a target folder in the end (with a space after -sh of course) to get size of the target folder. Use /* to get the sizes of the first level subfolders.

To check a folder and it’s subfolders up to a certain level, use the --max-depth option. For example

du -h --max-depth=2 <folder name>

HDFS

If your server happens to be part of a Hadoop cluster, you’ll have access to the Hadoop Distributed File System (HDFS) for storing large files. HDFS has similar command:

hadoop fs -du -s -h /hdfs/target/folder

This command will print out the size of the target HDFS path:

/hdfs/target/folder

--

--