Parallel. Straight from your command line

Alon Nisser
4 min readFeb 4, 2015

Using Gnu Parallel — running parallel processing directly from you shell

What is Parallel?

Well, as the name might give away, Parallel is a shell parallelization tool. Parallel is a relatively new addition for GNU, and has many of xargs functionality built in. But the main feature is getting a job/task/program we want to run in parallel, a simple (canonical) example:

parallel echo ::: A B C

would echo

A

B

C

Let’s break this down:

On the right hand side after the ::: we’ve got an “array” of inputs,

and on the left hand side we’ve got the parallel command followed by the command to run with the input from the right side.

The result is equivalent to running:

echo A

echo B

echo C

But.. in parallel. Notice that since this is running in parallel! the order of execution might be different.

This isn’t very impressive, so let’s consider the following example (adapted from a real use case and cleared for bravity). Here I use parallel with a django management command, doing **long** processing on each account based on account number as input.

processAccount (){

local account=$1

python manage.py very_long_processing_command $account>>account_output_$account.txt 2>&1

}

accounts=$(get_accounts) #Cleared for bravity..

export -f processAccount

parallel -I{} -q bash -c ‘processAccount {}’ ::: $accounts

Let’s break this down:

First I define a bash function that calls the actual command I’m using, so I can encapsulate some more logic if needed before or after the actual job. Then I need to export this function, since I’m using parallel to call a bash sub shell that wouldn’t know about the local function without the export. Then I use -I flag to define the replacement string (where the argument from the right side would go). Although this is the default replacement string, I think defining it explicitly is a good practice if changed in some configuration. ‘-q’ is used when the command needs quoting, and the {} is where the parameter from the $accounts array would go.

(I’ve just discovered) The replacement string can take some baked-in options for certain functionality, such as {#} for the job number or {/} for removing input.

In the right hand side, as before, we have a variable $accounts hold a bash array of account number.

Parallel would run the ‘processAccount’ function **parallely** and turn a long wait for the synchronous task run to a short wait, as long as the longest task.

Handling constraints

Of course, if we run thousands of parallel cpu/io/memory consuming jobs we might starve our machine out of resource, Forcing it to shut processes and eventually break. Other constraints might be with the number of connections our db can concurrently take, or with our with an api our job is using. How can we solve that? No worries, parallel can handle that.

Let’s take our previous example and add a parameter:

parallel -j 20 -I{} -q bash -c ‘processAccount {}’ ::: $accounts

the -j or —jobs parameter specifies how many job should be run parallely. In our case, no more then 20 jobs. —jobs 0 would run as many jobs as possible, the default is running as many jobs as cpu cores, —jobs also takes percentage inputs such as —jobs 200% for running 2 jobs per cpu core, or 50% for running jobs for half the number of cores.

More limits

Another optional limit is system load with the —load 90% (for example), setting a maximum load for the system, parallel won’t start another job until the load is beneath this level.

Delaying execution between a job end and the next job start is also needed sometimes (to release system resources for other tasks for example) and can be achieved with —delay 2 (for example)

being cautious

Concurrent computing rightfully earned some notorious reputation, locks, race conditions are among the reasons lots of us don’t take it lightly. Also Parallel is cli tool, running outside of our program, Using it without reasoning on flow, dependencies, resources, etc can be dangerous and lead to unexpected results. I believe Parallel is a good choice for running batch tasks, or automating stand alone jobs, but caution is always adviced.

Great! how do I get it.

Parallel doesn’t come preinstalled on most linux distributions, so You need to install it.

In debian you can simply ```sudo apt-get install parallel” And other linux distributions probabely also have it in their repos. You can also use the following one linerin every linux distro:

(wget -O — pi.dk/3 || curl pi.dk/3/ || fetch -o — http://pi.dk/3) | bash

Advanced use cases

Of course I only scratched the possibilities and options of Parallel, it has many more features and advanced use cases: running parallel processes on remote machines (with ssh), reading input from file, multiple arguments, showing progress reports, running interactively etc.. Here is a good source (the examples are from the field of biological research, but relevent in all use cases). and the docs, the official tutorial or the man file is always a good way to learn.

The aim of this post was to introduce parallel, a not known enough gnu command line tool, and to help you start using it, I hope I succeeded in that.

--

--