Building datasets with Bash

How to build and structure your dataset with Snakmake

7 min readDec 5, 2023

Organize the creation process

Did you ever think about creating a dataset? I am sure you have some ideas on what to build, but have you also thought about how to structure such a project?

While generating data does not have to be hard, it can be a challenge to keep your workflow organized.

Ideally, you design a creation process distinct steps. And those steps should be repeatable.

In this article I show how to use Bash and Snakemake to build a dataset.

Why build a dataset?

Data creation is more popular than ever, and for good reasons.

While training models has be come easy in the list ten years, data quality and availability remains an issue. Synthetic, scraped or simulated data can be a solution.

My personal motive for building datasets, is that Building a dataset helps me to connect to the system at hand in a different way. Generating data requires a different level of understanding and a different way of thinking.

Why use Bash and Snakemake?

The example dataset that we will produce in this article could have been produced with a script in R or Python. I deliberately picked Bash and Snakemake. Why?

I chose Bash because the command line is a great environment for working with data. Endowed with a large set of small (gnu) tools, the shell offers unparalleled versatility. The dataset in this article will be created with no more than 10 lines of bash code.

Snakemake adds to versatility of the command line by gluing together data science workflows across languages (R, Python, Bash, Julia, Rust-script) and across environments (conda, docker). It can combine any numbers of scripts and environments and execute them locally or remote.

Gluing tasks together

Suppose you want to scrape some data in Python, send it through a deep learning model in a docker container and make beautiful ggplot2 plots in R? You can use make, a bash script or a Python script to glue the steps together but none of these provide a smooth ride.

Bash is hard to read and maintain. Python and R would need many lines of code to determine what part of the workflow to (re-)run.

Make overcomes this by sorting out which files need to be build first. However, make is not intuitive to write and it has no support for containers or conda environments.

Snakemake glues together all the steps with little effort. That’s why it is worth to use Snakmake for your projects.

Thinking about your own dataset

The latest wave of AI startups uses generative AI to generate data that could not be generated before.

There are many other methods though to create data. Here is a list to get you thinking about creating your own dataset:

Web scraping
Document scraping (OCR and such)
Accessing API’s
Sensors (there are some good smartphone sensor apps)
Electronics and Robotics (I did some IMU / GPS sensor fusion on a micro controller)
Simulation (Physics, Chemistry, Biology, Statistics, Reinforcement learning)
Processing / enriching existing datasets (Augmentations, joining datasets)
System logging (server logs, networking, security)

In this article I build a simple system logging dataset.

Mirror ranking

The dataset that I am going to build for this article is based on the mirror ranking tool reflector.

As part of the Arch Linux ecosystem, reflector measures download speed for different servers across the world. It then ranks the severs accordingly.

In my article Measuring mirror throughput with bash I described a tiny script that

Rates mirrors according to download speed
Records network usage at the same time

This is the script:

trap 'kill $(jobs -p)' EXIT
nethogs -d 2 -bt | grep python > rates.txt &
reflector  --score 5 > mirrors.txt

Ranking

The script produces rankings like these:

# mirrors.txt
################################################################################
################# Arch Linux mirrorlist generated by Reflector #################
################################################################################

# With:       reflector --score 5 --sort rate
# When:       2023-11-24 08:17:19 UTC
# From:       https://archlinux.org/mirrors/status/json/
# Retrieved:  2023-11-24 08:16:59 UTC
# Last Check: 2023-11-24 08:07:33 UTC

Server = https://mirror.osbeck.com/archlinux/$repo/os/$arch
Server = http://mirror.ubrco.de/archlinux/$repo/os/$arch
Server = http://arch.jensgutermuth.de/$repo/os/$arch
Server = http://mirrors.qontinuum.space/archlinux/$repo/os/$arch
Server = http://mirror.moson.org/arch/$repo/os/$arch

Rates

The script produces rates like these:

# rates.txt
/usr/bin/python/703838/1000 13.59   1808.46
/usr/bin/python/703838/1000 14.5596 1717.18
/usr/bin/python/703838/1000 14.6182 1694.34
/usr/bin/python/703838/1000 13.2963 1683.07

Columns: KB/s (up), KB/s (down).

From a single datapoint to multiple datapoints

The dataset that I am creating differentiates between

Countries: Japan, Australia, Indonesia
Run: 0–9

In the next sections we build up the rules to produce the following directory structure:

ranking-australia-[0...9].txt
ranking-china-[0...9].txt
ranking-india-[0...9].txt
ranking-indonesia-[0...9].txt
ranking-japan-[0...9].txt
rate-australia-[0...9].csv
rate-china-[0...9].csv
rate-india-[0...9].csv
rate-indonesia-[0...9].csv
rate-japan-[0...9].csv

Building the workflow

Snakemake is a like GNU make, but with data science and bio informatics in mind.

It works by defining rules in a text file called Snakefile. In the next sections we will build the Snakefile file step by step.

1. Constants

Snakemake accepts Python statements, so we start off defining two constant that we can refer to later on:

N_RUNS = 10
COUNTRIES = ["japan", "china", "australia", "indonesia", "india"]

Intermezzo: On rules

Every file that we want to generate needs to be covered by a rule.

This is a minimal example:

#Snakefile
make_copy:
  input:
    "original.txt"
  output:
    "data.txt"
  shell:
    "copy {input} {output}"

The rule declares how to produce the file copy.txt.

We can trigger this rule from the command line with

snakemake -c1 make_data

The flag -c1 tells snakemake to use just one core.

In general, a rule will be triggered when

One of the output files is not present
One of the input files has changed
The code for generation has changed
A (conda) environment has changed
Some other dependency has change; e.g. a parameter
Explicit trigger by other rule
Manual trigger from command line

2. A rule to define target files

Back to our dataset, we add the following rule that tells what files should be created. It doesn’t tell how to create it:

rule create_data:
  input:
     expand("results/data/example-dataset/{type}-{country}-{n}.txt", type=["ranking", "rate"], n=range(N_RUNS), country=COUNTRIES),

Running

snakemake -c1 create_data

will trigger this rule. Snakemake will look for other rules to generate the target files.

Note the expand function here. It is a helper function that comes in handy in many places.

3. A wildcard rule

Now we add a rule telling how to create the above target files:

rule rank_country:
  output:
     "results/data/example-dataset/ranking-{country}-{n, \d+}.txt",
     "results/data/example-dataset/rate-{country}-{n, \d+}.txt",
  shell:
      """
      #!/bin/bash
      trap 'kill $(jobs -p)' EXIT
      nethogs -d 1 -bt |grep python >  {output[1]} &
      reflector --country {wildcards.country} --score 20 --sort rate > {output[0]}
      """

When this rule is triggered, it spawns a bash script that creates the files.

4. Further processing

We can process the rate files further, turning them into csv format:

rule rate_to_csv:
  input:
     "results/data/example-dataset/rate-{country}-{n, \d+}.txt"
  output:
     "results/data/example-dataset/rate-{country}-{n, \d+}.csv"
  shell:
    """
    awk -F "\\t" -v OFS="," 'BEGIN {{print \"up,down\"}} {{print $2, $3}}' {input} > {output}
    """

rule rate_csv:
  input:
     expand("results/data/example-dataset/{type}-{country}-{n}.csv", type=["rate"], n=range(N_RUNS), country=COUNTRIES),

The awk script prints a header for table and removes the first column of output from nethogs. Then it prints column 2 (up) and 3 (down) with a comma in between.

5. Rule for debugging

For debugging I used below shell command. I have put it into a rule for completeness.

rule show_rate_lengths:
  shell:
    "wc -l results/data/example-dataset/rate*.csv | sort -n"

6. Building the dataset

The dataset can be build with a simple command:

snakemake -c1 rate_to_csv

Snakemake now builds graph (DAG) of dependencies.

It looks for missing targets, for input files that have changed and for other dependencies that have changed. Then it decides which rules need to be executed.

After the the dataset is build, we are ready to work with it.

Here are the recorded download rates for India for 3 runs:

And here you see averages and variability over 10 runs per country:

Conclusion

On the command line, data can often be inspected and manipulated with just a couple of one-liners.

If we glue those scripts together with a general scripting language, then we end up re-computing every step of the workflow for each change that we make.

We could use gnu make to manage workflows, but then we still have to write boiler code to manage conda environments and containers.

Snakemake handles all of these aspects of data workflows and many more; like parallelism and distributed computing.

Snakemake works with an intuitive language that is a mixture of Python and Yaml. It is natural to read and easy to learn.

Perfect for building datasets!

Learn more

Measuring mirror throughput with bash

Snakemake documentation

I am going to write two more articles on this topic:

A challenge on ranking statistics. Will be based on a similar dataset
A tutorial on Snakefile rules

Follow me to get notified for the next episode.

Feel free to connect to me on LinkedIn. I love it when you include a small message.