Soft-clipping VS hard-clipping in read alignment

Clipped alignment

Wenyu
4 min readApr 7, 2023
REF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG
READ: gggGTGTAACC-GACTAGgggg

In a nutshell, clipped bases are those on your query (reads or assembly) while not on your reference. In this example, g in the lower case should be clipped since they are not on the reference. There are multiple reasons why this would happen, e.g. adapters, INDEL and translocation.

Why is clipping used in sequence alignment?

Clipping can help improve the quality of alignments by focusing on the high-confidence, well-aligned portions of the reads and discarding or ignoring the low-quality or non-aligning parts. This is particularly helpful when dealing with noisy long-read sequencing data, which often contain errors and low-quality base calls.

Example

Let’s assume we have a read R that is 1000 bases long, and it has been generated by a long-read sequencing technology like Oxford Nanopore. Due to the inherent error rate and noise associated with this technology, the first 200 bases and the last 100 bases of the read have low quality and do not align well to the reference genome. However, the middle 700 bases align well.

Without soft clipping, the alignment algorithm might try to force the entire read to align to the reference genome, which could lead to an inaccurate alignment and potentially incorrect downstream analysis results (e.g., false-positive variant calls).

With soft clipping enabled, the alignment algorithm would “clip” or ignore the first 200 and last 100 bases of the read and focus on aligning the middle 700 bases. This would result in a more accurate alignment and a better representation of the true underlying genomic sequence.

Difference between soft-clipping and hard-clipping

Sequence Alignment/Map Format Specification

Soft clipping: Soft clipping retains the unaligned portions of the read in the alignment file but does not use them in the actual alignment. This means that while the unaligned bases are not considered when calculating alignment scores or making downstream analyses, they are still present in the alignment output file (such as a SAM or BAM file). Soft clipping is useful for preserving the original read information while focusing on high-quality, well-aligned regions for analysis.

Hard clipping: Hard clipping, on the other hand, removes the unaligned portions of the read from the alignment file entirely. This means that the unaligned bases are not only ignored during alignment but are also not present in the output file. As a result, the original read information is lost, and it is not possible to recover the full read sequence from the alignment file.

Example: Assume we have a read R with 100 bases, and only bases 21 to 80 align well to the reference genome. With soft clipping, the alignment algorithm would align only the well-aligned portion (bases 21 to 80) but still include the full read sequence (bases 1 to 100) in the output alignment file. What do I mean? In your SAM?BAM file, the 10th field indicates the segment (I hope this word doesn’t confuse you since the segment here is not your original sequence. Instead, it’s part of it). Whether the sequence appear on the segment depends on whether hard-clipping or soft-clipping is used.

When to use soft-clipping and when to use hard-clipping?

Let’s start with an example:

# Check the flag of each alignment using hard-clipping
awk '$6 ~ /^[0-9]+H/ {print $2}' map.sam | while read flag; do samtools flags $flag; done | less
# Part of the result
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x810 2064 REVERSE,SUPPLEMENTARY
0x810 2064 REVERSE,SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x810 2064 REVERSE,SUPPLEMENTARY
0x810 2064 REVERSE,SUPPLEMENTARY
0x810 2064 REVERSE,SUPPLEMENTARY
0x810 2064 REVERSE,SUPPLEMENTARY
0x810 2064 REVERSE,SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x810 2064 REVERSE,SUPPLEMENTARY
0x810 2064 REVERSE,SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
0x800 2048 SUPPLEMENTARY
# Check the flag of each alignment using soft-clipping
awk '$6 ~ /^[0-9]+S/ {print $2}' map.sam | while read flag; do samtools flags $flag; done | less
0x10    16      REVERSE
0x0 0
0x10 16 REVERSE
0x0 0
0x0 0
0x10 16 REVERSE
0x0 0
0x0 0
0x10 16 REVERSE
0x10 16 REVERSE
0x10 16 REVERSE
0x10 16 REVERSE
0x0 0
0x10 16 REVERSE
0x0 0
0x10 16 REVERSE
0x0 0
0x10 16 REVERSE
0x10 16 REVERSE
0x10 16 REVERSE
0x10 16 REVERSE
0x0 0
0x0 0
0x0 0
0x0 0
0x100 256 SECONDARY
0x0 0
0x0 0
0x0 0
0x100 256 SECONDARY
0x10 16 REVERSE
0x0 0
0x10 16 REVERSE
0x0 0
0x10 16 REVERSE
0x10 16 REVERSE
0x0 0
0x0 0
0x0 0

Soft-clipping is used in primary alignment and secondary alignment, while hard-clipping is used in supplementary alignment. My understanding is that every supplementary alignment will be associated with a primary alignment and it will use soft-clipping, which means for this query you can find the complete sequence in your SAM/BAM files, therefore, it’s not necessarily to allocate memory for storing that redundant information. The alignment file can be very big!

Last but not least, you may have noticed that minimap2 has an option -Y to use soft clipping for supplementary alignments.

Thanks for reading!

--

--