Downloading multiple samples from a entry at GEO Dataset

“black binders on steel rack” by Samuel Zeller on Unsplash

Openly shared data is invaluable. It provides a way for others to test reproducibility of analysis and reduces the need of repeated screening experiments. Besides, these data is also an excellent training ground for amateurs like me.

Sometimes, the dataset I want consists of multiple samples. I first clicked all the download links manually, but I soon got lost and forgot which ones I hadn’t downloaded. Thankfully, I realized repetitive tasks like this on a computer can often be automated.

For a dataset with multiple experiments, like this one, I often want to get all the raw data or the supplementary summary tables, and I am going to describe how I did it here.

Batch download of supplementary tables

In the bottom of the GSE series page, there are several files containing metadata of the series. Each of these files contain the details about the experiments, and every entry is in the form of ![feature name] = feature description. Luckily, these metadata all provide a list of downloading list.

Link for downloading metadata

Link for each supplementary file is provided as !Series_supplementary_file in these files, so we can use regular expression in awk to find the link. The line looks like this: !Series_supplementary_file = ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE94n nn/GSE94883/suppl/GSE94883_D6pos_TM002.txt.gz.

In awk, regular expression operator is ~, so $0 ~ "^!Series_supplementary_file” finds the lines containing links. All these lines start with !Series_supplementary_file = , which we don’t want and is consisted of 29 characters. Here, substr() can help us removing these 29 and leaves only the link.

# This command shows all the links in terminal
awk '$0 ~ "^!Series_supplementary_file" {print substr($0, 30)}' GSE94883_family.soft

If your machine has gawk installed, the command above should print the links for you, and the last step we need is to pass those to a download command. I use curl for downloading in command line, so I need find a way for awk to communicate with system, and send the links to curl.

The function that helps us using system command in awk is system(), and with its power, we are able to pass the link to curl -O and let the computer do the rest of downloading.

awk '$0 ~ "^!Series_supplementary_file" {system ("curl -O "  substr($0, 30))}' GSE94883_family.soft

Batch download of raw data

What if I want raw data, so I could do analysis from scratch? The metadata also gives you a SRA accession so you can access those. The link to start with is saved as !Series_relation. The link will lead you to a page that list all the SRA results, and this page contains a link to Run selector, where you can download an accession list. Every accession number is listed in this file.

Link to SRA run selector.

With the list (and SRA-tools), you should be able to download raw data as .fastq (or other file format SRA-tools support) with a little help with awk.

awk '{system ("fastq-dump " $0)}' SRR_Acc_List.txt

Then, we are ready for analysis.