How I got tired of waiting and made my life sweeter in 15 minutes

As the headline hints, this story is all about ad-hoc shell scripts. Yes … once it is okay to something manually, but twice, you know it will happen again and you really need to automate it. So like so many others I have a collection of shell scripts for various ad-hoc tasks.

Today was one of the days where I had to use one of these scripts. I needed to clear some data from a database. No problem. Find the script, run it, and … wait. I soon realized that this small task would take about twenty minutes. Not enough time to justify a context switch, but certainly to much time to sit idle. Fortunately there is an easy fix: fire up a bunch of terminals and run the script in each with a subset of the data. Done.

When the problem arose again a couple of hours later, I had to, for the second time, split my data up into multiple chunks, launch multiple terminals, and run the scripts again. Well, now I knew that it would certainly happen again, so I really needed to automate this.

Meet Parallelize

What I needed was a tool that, given a shell script and some user ids, could run the script in parallel with each user id, but without overloading the database. Something like

parallelize fix-users.sh 1000 1001 1002 1003 1004

I could of course spend an hour installing and reading the documentation of GNU Parallel, but writing a similar tool myself from scratch would probably be faster, so I launched an editor and took this as a great opportunity to do some code in go.

Running processes in parallel is trivial in go. Create a single queue containing the jobs. Have a fixed number of go routines take jobs from the queue and execute the command. Wait for the go routines to finish. Something like

// ...
func producer(args []string) {
ch := make(chan string)
go func() {
for _, arg := range(args) {
ch <- arg
}
close(ch)
}()
return ch
}
func worker(id int, wg *sync.WaitGroup, cmd string, queue <-chan string) {
for arg := range(queue) {
log.Printf("Worker %d executing %d %d", id, cmd, arg)
if exec.Command(cmd, arg).Run() != nil {
log.Printf("Failed to execute %s %s", cmd, arg)
}
}
wg.Done()
}
func main() {
var wg sync.WaitGroup
cmd := os.Args[1]
args := os.Args[2:]
queue := producer(args)
wg.Add(workerCount)
for id := 1; id <= workerCount; i++ {
go worker(id, &wg, cmd, queue)
}
wg.Wait()
}

Parsing command line arguments is also fairly easy, once you accept that they don’t behave as getopt

var (
workerCount int
)
// ...
func main() {
flag.IntVar(&workerCount, "n", 8, "Number of `workers`"
flag.Parse()
if workerCount < 1 || flag.NArg() < 2 {
flag.Usage()
return
}
cmd := flag.Arg(0)
args := flag.Args()[1:]
// ...
}

Seriously … what was that? 15 minutes. Sweet.

Testing it was also pretty easy. Just run the sleep command and play around with the number of workers

parallelize -n 3 /bin/sleep 1 2 3 4 5

Once this was done I couldn’t resist adding a couple of other features, like a flag to suppress output and an option to read arguments from a csv-file. Feel free to check it out at github.com/madss/parallelize, but be warned: it is probably faster to write it yourself :-)