Creating publication-ready sequence logos in R
Sequence logos are a great way of visualising sequence alignments, that have been around for decades. The visualisation it it self is quite simple and similar in nature to bar charts, where the each bar or “stack” reflect how conserved that position in the alignment is. Each stack contains a set of stretched out letters that reflect the frequency of that nucleotide or amino acid in that position of the alignment.
Despite their popularity and there being numerous tools that generate sequence logos, only a very few of them provide native sequence logos plotting in R. I personally found this quite frustrating given that the concept of the visualisation was quite simple. Most commonly used in R, is the seqLogo package, which comes with major limitations: it can only be used with DNA sequences, and lacks the flexibility of tweaking the plot. Another package is RWebLogo, a wrapper for the python WebLogo library that I recently authored. However, it’s was quite buggy at times and is only a wrapper i.e. with no option for native plots.
With ggplot2 being one of the most versatile plotting libraries in R, it made sense that there was a package that made use of its capabilities to draw sequence logos. Integrated with ggplot2, this would allow a great amount of flexibility to users when it came to tweaking the visualisation. I came accross gglogo, a package that attempts to do this by drawing letters as polygons. However I found the logos it generated just didn’t look that great (left).
I liked the the approach of using polygons - after all, it seemed like the only way to get letters to stretch vertically and horizontally in ggplot2. I adopted this into a new package of my own, called ggseqlogo.
The goal was to make something that was easy to use, highly versatile, while producing publication-ready logos. To do this, there were a few features I made sure to include:
- The ability to create DNA, RNA or amino acid logos, as well as the option to define define custom alphabets
- Custom colour schemes
- Easily allow for plots to be combined in a grid
- Allow selection of different fonts
On top of all this, users could tweak almost every aspect of the visualisation using ggplot2. Below is a short tutorial demoing ggseqlogo’s features. You can also find the package on github:
ggseqlogo - Beautiful and versatile sequence logos in Rgithub.com
I hope ggseqlogo can relieve some of the frustrations I’m sure many of you have when it comes to plotting sequence logos in R. If you have any ideas or suggestions , please send them my way!
To get started, first install the package:
Fire up and load some sample DNA sequence data. I’m using binding sites for transcription factors.
# Some example DNA sequences
f = system.file("extdata", "sample_dna.rds", package = "ggseqlogo")
seqs_list = readRDS(f)
# Get first set of sequences
seqs_dna = seqs_list[]
# Plot a sequence logo
ggplot() + geom_logo(seqs_dna) + theme_logo()
Changing the colour scheme is also easy.
ggplot() + geom_logo(seqs_dna, col_scheme='base_pairing') + theme_logo()
If you’re not a fan of the preset schemes, you apply custom discrete or continuous colour schemes
# A discrete color scheme
cs1 = make_col_scheme(chars=c('A', 'T', 'C', 'G'),
groups=c('gr1', 'gr1', 'gr2', 'gr2'),
cols=c('purple', 'purple', 'blue', 'blue'))
# A continuous color scheme
cs2 = make_col_scheme(chars=c('A', 'T', 'C', 'G'), values=1:4)
ggplot() + geom_logo(seqs_dna, col_scheme=cs1) + theme_logo()
ggplot() + geom_logo(seqs_dna, col_scheme=cs2) + theme_logo()
To combine more than one sequence logo you can use gridExtra’s grid.arrange or the ggplot2 facet option. To do this all you need to do is pass a named list of sequences, and call facet_wrap.
ggplot() + geom_logo(seqs_list) + theme_logo() +
facet_wrap(~seq_group, ncol=1, scales='free_x')
You can change the font using the font parameter:
ggplot() + geom_logo(seqs_dna, font=2) + theme_logo()
Finally, ggseqlogo also lets you use custom alphabets i.e. one that isn’t amino acids or nucleotides:
seqs_other = c('10010', '10100', '01010', '01001')
seq_type='other', namespace=0:1) +