Getting started with parallelization in Bash

Robert Sandor
7 min readJun 9, 2019

Using parallelization in code is a tricky proposition. While it can significantly reduce the amount of time a task takes, it can cause a lot of frustration and confusion if done incorrectly. Recently, I came across a scenario in which I was performing file manipulation on the order of hundreds of thousands of files in Bash on a Linux machine and found myself agonizing over how long the process was taking while other people were waiting on my results. So though I had worked with parallelization a little bit before (using the multiprocessing library in Python), I found myself scrambling to figure out how to do so in Bash to cut down time for this project. While I found some StackOverflow posts showing some code that used parallelization, I had to dig a bit further to understand exactly what I was reading and typing. So this guide is to help build an intuition for how and when to do parallelization in Bash using forks.

If You Come To A Fork In The Code…

So what is forking? Forking is the creation of a new process that looks exactly like the original and executes the same code but with its own memory space and process ID. But what does that mean? Let’s take a step back and look at parallelization in general to get a better sense of that.

There are a few main ways to parallelize code, notably through multiprocessing and multithreading.

The main difference between the two is whether the memory is shared (multithreading) or not (multiprocessing). Now while multithreading has its uses and numerous pitfalls (like deadlock, race conditions, etc.) that I won’t delve into here, I’ll only discuss multiprocessing, especially since multithreading is not possible in Bash. Additionally, not every task benefits from using parallelization and there is an overhead associated with creating threads and processes. Typically, code that benefits from parallelization shouldn’t be reliant on previous steps and doesn’t need to be executed in a particular order.

So where does forking fit into this? Forking is a form of multiprocessing that only exists in Linux and Unix-based systems, so you won’t see this in Windows. It’s typically considered to be a lower-level interface in comparison to other alternatives like Python’s multiprocessing library, which is a higher-level interface that works across all supported operating systems.

So how exactly do we fork in Bash?

Get It Working, Get It Right…

In the scenario I mentioned earlier, I had to copy hundreds of thousands of specific files that were spread across about as many directories. The file structure looked something like this.

.
+--batch_output_directory/
| +--batch_1/
| +--unwanted_file.txt
| +--another_unwanted_file.txt
| +--file_i_want_001.txt
| +--batch_2/
| +--unwanted_file.txt
| +--another_unwanted_file.txt
| +--file_i_want_002.txt
| +--batch_3/
| +--unwanted_file.txt
| +--another_unwanted_file.txt
| +--file_i_want_003.txt

Initially, I used a simple loop using the code below and tested it on a small sample size of directories and it worked.

#!/usr/bin/env bash
# cp_batch_output_files.sh
input_dir=$1
output_dir=$2

for dir in $input_dir$3*
do
pattern=$4
for file in $(ls $dir)
do
if [[ $file == *$pattern* ]]; then
cp $dir/$file $output_dir
fi
done
done

This script basically takes in 4 space-separated arguments on the command line to look for files that had a particular pattern in the filename in the directory structure that I described above and copy them to a different directory. To call the script above, I used the command below in the Terminal:

bash cp_batch_output_files.sh input_directory/ output_directory/ batch_ file_i_want_

…Get It Fast

Now that worked relatively fast on the small number of directories that I tested on, however I found out that it didn’t scale particularly well with the actual number of files I was dealing with and took way too long for what I needed. Since the copying of each file was independent of the others and nothing else relied upon the copied files being in a particular order, I figured that my case might benefit from parallelization. So after checking out StackOverflow, I eventually came up with the code that I have below, which I’ll break down.

#!/usr/bin/env bash
# cp_batch_output_files.sh
copy_file() {
input_dir=$1
output_dir=$2
pattern=$3
dir=$4

for file in $(ls $input_dir/$dir)
do
if [[ $file == *$pattern* ]]; then
cp $input_dir/$dir/$file $output_dir
fi
done
}
# this gets the max number of processes for the user
max_num_processes=$(ulimit -u)
# An arbitrary limiting factor so that there are some free processes
# in case I want to run something else
limiting_factor=4
num_processes=$((max_num_processes/limiting_factor))
input_dir=$1
output_dir=$2
file_prepend=$3
for dir in $(ls $input_dir)
do
((i=i%num_processes)); ((i++==0)) && wait
copy_file “$input_dir” “$output_dir” “$file_prepend” “$dir” &
done

So the first noticeable aspect of this code is that I abstracted what was previously in the inner for loop into a function which is necessary to use forking. Interestingly, Bash doesn’t take in parameters within the parentheses like other languages and you have to store parameters in variables within the function body.

Here I’m going to temporarily skip the portion involving the number of processes to focus on this chunk here:

input_dir=$1
output_dir=$2
file_prepend=$3
for dir in $(ls $input_dir)
do
((i=i%num_processes)); ((i++==0)) && wait
copy_file “$input_dir” “$output_dir” “$file_prepend” “$dir” &
done

Like before, I take inputs from the command line, but this time I call the function I had created using the line:

copy_file “$input_dir” “$output_dir” “$file_prepend” “$dir” &

Here’s the key part: after I pass in the variables that hold the values, I use the ampersand (&) to fork the code or duplicate the process and all variables at that moment so that it has all of the same information as the original process but also has a separate process id. Additionally, this new process now runs in the background. And that’s all you need to use parallelization in Bash.

So Many Things To Do, So Few Processes

So why do I have extra lines, including the following line?

    ((i=i%num_processes)); ((i++==0)) && wait

In my particular case, I had an unusually large number of files and if I just forked, I would have had way more processes than the computer I was using could or should handle and the other people using the computer wouldn’t have appreciated it. So the line above makes sure I limit the number of running processes to num_processes . As soon as it reaches its limit, it waits until at least one of the processes is free again.

Now that’s fine and dandy, but I (and probably some other people) would like this to go as fast as possible, so what’s the upper limit for how many processes I can run at a time? The answer can be found in the line below:

max_num_processes=$(ulimit -u)

ulimit -u allows you to see how many processes your user can have at a time. If you use ulimit -a , you can see the limits associated with your login, which may look like this.

$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 2060605
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Noticeably, if you look at the line max user processes (-u) 4096 , you can see the maximum number of processes a user can initiate. I chose to use only a fraction of those processes to be cautious, but ultimately you should be able to use up to that number specified without any issues.

Conclusion

So hopefully by now you have a general sense of what forking and parallelization are, when it might be good to use it, what kind of parallelization you can use in Bash, and some details on how to actually do it so that you can use all that spare time to focus on more interesting and important things. I didn’t cover all of the details of parallelization, especially multithreading and the other alternatives to Bash’s fork, like xargs and GNU’s parallel, but this should be enough to get someone unfamiliar with parallelization to get started. I’ve provided links below for more details.

As usual, please let me know if you have any questions or suggestions to make this better.

And feel free to check out some of the projects I’m working on and learn more about me at these links:

https://robertisandor.github.io/
https://www.linkedin.com/in/robert-imre-sandor-data-science/

--

--