Downloading NGS datasets using Nextflow
A simple to use pipeline to download FASTQ files from NCBI or EBI, which serves as a good advertisment for my favourite workflow manager
A quick start
If you have a Linux machine with Docker and Nextflow installed, you can try the final result: create a text file with a list of SRA accession numbers like:
SRR8652866
SRR8652865
SRR8653287
And save it as list.txt
, the run the command:
nextflow run telatin/getreads --list list.txt --outdir data \
-profile docker
Nextflow will automatically download the repository with the workflow (from github.com/telatin/getreads), then fetch the Docker container with the dependencies, and will download in parallel the requested samples, as depicted in the screenshot below.
What is Docker
Docker is a popular ecosystem to manage, execute and distribute container images, in this context is a system to ensure that a set of tools will work in any machine capable of executing Docker.
What is Nextflow
Nextflow is a workflow language and a task orchestrator. It allows the creation of multistep workflow separating the logic and the configuration, making them easily shareable across different premises (local computers, High Performance Clusters with schedulers like Slurm or PBS, cloud environments like AWS or Azure…).
Why this workflow?
One day NCBI went mad with their APIs and broke existing workflows that I was using to retrieve raw data. With Nextflow I have been able to draft a workaround in one day, and that to me has been a greate example of the flexibility of the platform. Exercises apart, I recommend checking a robust and fully powered pipeline called nf-core/fetchngs. It’s a ⭐️⭐️⭐️ pipeline!
A primer on Nextflow
If with this short article I made you curious, and you’d like to learn how to write a workflow using Nextflow, check my tutorial that will bring you to write a de novo assembly pipeline for bacterial genomes.
See the full schematics of the final result below:
How to run the final example
Again, before trying to follow the tutorial and make the pipeline by yourself, try running it as shown in the video. Nextflow requires some knowledge both to create pipelines and to execute them in your premises (resource management, dependencies …) so it’s a good exercise to give it a go!
Let me know if this helped, slapping a star in the github repository of the tutorial!