How to create a Nextflow pipeline with all necessary tools for a series of repetitive bioinformatics tasks
This post shows how to build and run a Nexflow pipeline to:
- Fetch SRA IDs for each sample in an SRA Dataset
2. Download the fastq files for each sample
3. Align each sample using STAR
4. Count transcripts using the Rsubread library
It took 1h to write the pipeline and 2h to process more than 40 samples (70 GB) for around $40.
We will use Nextflow, Dockerhub, Github, AWS, and Lifebit
We need a Docker Container which will be used in out Nextflow pipelines. This container ensures we are always re-using the same tools, versions, OS, etc., and that our results are reproducible. Sometimes just changing the OS version will update the boost math libraries used by R which can affect your end results. For reproducibility, Docker is awesome.
> mkdir star-nf
> touch star-nf/Dockerfile
> touch star-nf/environment.yml
> touch star-nf/nextflow.config
> touch star-nf/main.nf
> cd sra-fetch
The Dockerfile should look like this, this is the starting point for our Docker container. It’s already the nfcore/base container:
The environment.yml file used to build the Docker file includes all the dependencies we need
The nextflow.config file tells our pipeline to pull and use our container from Dockerhub.
Note the container was built and pushed to Docker hub with these commands:
> docker build -t pprietob/star-nf
> docker login
> docker push pprietob/star-nf:latest
The main.nf file, which contains the steps of the Nextflow pipeline, looks like this:
We are using the index from igenomes which are hosted in AWS S3 and we only have one parameter with the SRA ID we want to analyze
params.project = "SRP115256"
The Pipeline has four steps:
- Get SRA IDs for each sample in the project
- Download the fastq files for each sample
- Align each sample using STAR
- Count transcripts using the Rsubread library
Step 1. Get SRA IDs for each sample in the project
For the first step of the pipeline we will use the ‘esearch’ utility and pipe the output to a file called sra.txt
Second, we will take the content of that output and pipe it into a channel called sraIDs. This channel will be used in the second process of the pipeline.
Third, we add the input parameters to receive the ID of the SRA dataset we want to analyze.
To get more familiar with the concepts of Nextflow, check their Docs
Step 2. Download the fastq files for each sample
We use the parallel-fastq-dump command from sra-tools
(the splitText() function above the pipeline step is used to split and send each line to a new channel called singleSRAId. Each will be processed in parallel).
Step 3. Align each sample using STAR
Output BAM files are sent to a channel called alignReads used in the next pipeline step.
Final and 4th step. Count transcripts using the Rsubread library
How to run this Quick and Cheap?
Select a large spot instance (AWS spot instances cost less than guaranteed Instances)
Click run and wait for your results. In this case, it took 2h to process more than 40 samples (70 GB) for around $40.
Behind the scenes, 42 instances are running in my AWS account