Reverse and find complement sequence in R

Yen-Chung Chen
Dec 20, 2018 · 4 min read
Photo by Alessandra Caretto on Unsplash
  1. Convert base by base: ATCG to TAGC for DNA; AUGC to UACG for RNA
  1. Reverse it with rev()
seq_rev <- function(char) {
alphabets <- strsplit(char, split = "")[[1]]
return(rev(alphabets))
}
seq_compl <- function(seq) {
# Check if there's "T" in the sequence
RNA <- Reduce(`|`, seq == "U")
cmplvec <- sapply(seq, function(base) {
# This makes DNA the default
# As long as there's no U, the sequence is treated as DNA
if (RNA) {
switch(base, "A" = "U", "C" = "G", "G" = "C", "U" = "A")
} else {
switch(base, "A" = "T", "C" = "G", "G" = "C", "T" = "A")
}
})
return(paste(cmplvec, collapse = ""))
}
seq_compl(seq_rev("CCAAT"))
> seq_compl(seq_rev("CCAAT"))
[1] "ATTGG"
  1. Handle the issues that are foreseeable.
revcom <- function(input) {
# Make sure the input is character and in upper case
input <- as.character(input)
input <- toupper(input)
# Use regular expression to check if there's only legal bases
# present in the sequence
legal_char <- Reduce(`&`, grepl("^[A,T,C,G,U]*$", input))
if (!legal_char) {
stop("revcom() only applies to DNA/RNA sequences, and only A/T/C/G/U is allowed")
}
rev <- seq_rev(input)
return(seq_compl(rev))
}
# The function seems to work
> revcom("ATTTCAT")
[1] "ATGAAAT"
# It handles illegal characters and provides an error
> revcom("ATTTCATX")
Error in revcom("ATTTCATX") :
revcom() only applies to DNA/RNA sequences, and only A/T/C/G/U is allowed
# Non-character input is converted to character and then judged
# whether they are legit
> revcom(1)
Error in revcom(1) :
revcom() only applies to DNA/RNA sequences, and only A/T/C/G/U is allowed
# Generate a 100 kb random DNA sequence
testseq <- paste(sample(x = c("A", "T", "C", "G"), replace = TRUE, size = 1e5),
collapse = "")
# Preallocate a vector to save the running time for each repeat
elapsedtime <- vector()
# Find reverse-complement sequence for 10 times
for (i in seq(10)) {
start <- Sys.time()
invisible(revcom(testseq))
end <- Sys.time()
elapsedtime[i] <- end - start
}
# Print the averge time spent
print(mean(elapsedtime))
[1] 0.1781383

biosyntax

Notes, thoughts, and random experiments in life science.

Yen-Chung Chen

Written by

A learning developmental biologist

biosyntax

biosyntax

Notes, thoughts, and random experiments in life science.