How to run Picard tools CollectHsMetrics on an exome sequencing library

Published in

Computational Biology Core

4 min readJan 8, 2020

Picard tools are a set of tools for manipulating files (SAM, VCF, etc.) from high-throughput sequencing experiments. They include several modules for collecting and reporting various sequencing quality metrics. This blog post details how I ran Picard tools’ CollectHsMetrics on an exome sequencing library.

You can find more details on Picard tools on github. According to the documentation, CollectHsMetrics requires the following input:

Alignment file (.BAM or .SAM)
Reference sequence .fasta file
List of target intervals
List of bait intervals

While the alignment and reference .fasta file should be easy to find, the bait and target intervals can be a little more challenging.

Download baits and intervals files

In this particular case, the capture kit used was the Agilent SureSelect Human All Exon v5 kit, so I went here and signed up for an account. After signing in, I was directed to the following landing page, where I clicked on the ‘Find Designs’ tab at the top middle of the page and clicked ‘SureSelect DNA’ from the drop-down menu that appeared.

After logging in, click on the ‘Find Designs’ tab and select ‘SureSelect DNA’ from the drop-down menu.

When I arrived at the next page, I selected the tab for ‘Agilent Catalog’ and checked the box for H. sapiens (hg19) in the left menu. The results table did not populate until I also checked the box for ‘Design Category’ (also in the left menu).

Select the ‘Agilent Catalog’ tab, click the ‘Design Category’ box in the left menu. After the table populates, click the row where ‘Name’ is ‘SureSelect Human All Exon V5’.

After I clicked the link for ‘SureSelect Human All Exon V5’ (highlighted in blue in the above figure), another popup appeared, which had a pull-down menu to indicate which genome build I wanted to view. I selected hg19 and clicked ‘View Design Details’ and clicked the download link that appeared.

Select ‘hg19’ and click the ‘View Design Details’ button.

I downloaded all of the files, since at this point I wasn’t sure which file would be the baits and which would be the intervals.

Select all files using the check boxes and then click the ‘Download’ button.

Which file is the baits and which is the intervals?

To get a better sense of what was in the .bed files, I looked at the first few lines of each file using the headcommand and saw the following information in the first few lines of each file:

The second line returned from S04380110_Padded.bed was:
track name="Padded" description="Agilent SureSelect DNA - SureSelectXT Human All Exon V5 - Covered bed file extended by 100bp on either side"

The second line returned from S04380110_Covered.bed was:
track name="Covered" description="Agilent SureSelect DNA - SureSelectXT Human All Exon V5 - Genomic regions covered by probes".

The second line returned from S04380110_Regions.bed was:
track name="Target Regions" description="Agilent SureSelect DNA - SureSelectXT Human All Exon V5"

Based on these descriptions, I used S04380110_Covered.bed as the baits and S04380110_Regions.bed as the targets files.

Convert baits and intervals and running CollectHsMetrics

CollectHsMetrics expects the baits and targets files to be in interval list format, so I used Picard tools BedToIntervalList tool to convert the .bed files to the expected interval list format. BedToIntervalList requires a sequence dictionary file, which you can create with Picard tools' CreateSequenceDictionary if you don’t already have one. I converted the .bed files to intervals with the following bash script:

#!/bin/bash
#SBATCH -t 8:00:00
#SBATCH --mem=4GB
#SBATCH -c 8
#SBATCH -J picardtools_BedToIntervalListsource /gpfs/runtime/cbc_conda/bin/activate_cbc_condajava -Xmx1000M -jar /gpfs/runtime/cbc_conda/cbc_conda_v1_root/envs/cbc_conda_v1/share/picard-2.13.2-1/picard.jar BedToIntervalList \
I=S04380110_Covered.bed \
O=S04380110_Covered.interval_list \
SD=/gpfs/data/cbc/refchef_refs/broad_bundle_hg19/primary/ucsc.hg19.dict

After the BedToIntervalList script finishes running, you can run CollectHsMetrics:

#!/bin/bash
#SBATCH -t 8:00:00
#SBATCH --mem=10GB
#SBATCH -c 8
#SBATCH -J picardtools_CollectHsMetricssource /gpfs/runtime/cbc_conda/bin/activate_cbc_condajava -Xmx1000M -jar /gpfs/runtime/cbc_conda/cbc_conda_v1_root/envs/cbc_conda_v1/share/picard-2.13.2-1/picard.jar CollectHsMetrics \
I=alignment.bam \
O=output_hs_metrics.txt \
R=/gpfs/data/cbc/refchef_refs/broad_bundle_hg19/primary/ucsc.hg19.fasta \
BAIT_INTERVALS=S04380110_Covered.interval_list \
TARGET_INTERVALS=S04380110_Regions.interval_list

Now you will have information to access the performance of your exome sequencing experiment. For more details on how to interpret the output from CollectHsMetrics, click here.

How to run Picard tools CollectHsMetrics on an exome sequencing library

Written by Joselynn Wallace