Improving Codon Optimization with Recurrent neural networks

Published in

lattice-automation

6 min readApr 20, 2023

Background

Recombinant DNA technology has provided cardinal breakthroughs in biomedical research — from cheap insulin production to hepatitis vaccines. As of 2022, the recombinant therapeutic market was valued at over 2.8 billion dollars and expected to double by 2029 (Source: Coherent Market Insights).

The central dogma of molecular biology states that DNA is transcribed into RNA, which is then translated into proteins. This idea has been the foundation of research in recombinant technology, allowing us to design plasmids (see another Lattice tool, repp, for DNA assembly) and express proteins of our choosing.

A variety of factors influence expression levels of recombinant proteins, including the choice of expression vector, promoter, and codon bias.

Optimization without considering the codon usage or codon bias of the chassis is not ideal! Let’s take a look at codon optimization… (“Running Away Balloon Meme” format from Superlmer)

Codon Optimization

Every organism has its own codon bias — or the preference at which it uses codons in its genome. Since translation of mRNA into proteins is a process that ultimately relies on the codons present in the mRNA, understanding codon bias can help us improve expression levels of recombinant proteins.

This is the idea behind codon optimization, which involves changing the codon sequence of the target gene to match the optimal codon usage of the expression host. In this way, codon optimization can significantly boost expression levels of recombinant proteins. In an Escherichia coli (E. coli) chassis, increases of two to fifteen-fold have been observed!

This is pretty exciting, and it is core to the ideation process for our recent release of ICOR — a codon optimization tool that uses AI to best optimize genes for expression in E. coli.

Check out ICOR’s accompanying GitHub and Publication here!

User workflow for sequence codon optimization using ICOR deep learning model with overview of model creation. (A) A user workflow towards creating a vector for heterologous expression is depicted. (B) Expanding out of “sequence codon optimization” in the user workflow, the overview of the ICOR model creation is given. In a production setting, a trained and packaged model of ICOR is inferenced.

Why AI?

It’s evident that artificial intelligence (AI) technology has been shaking up everything around us. It’s no surprise that AI may offer new insights into the codon optimization problem.

Our review concluded that many industry-standard codon optimization tools rely on biological indexes that replace synonymous codons with the most abundant codon found in the host organism’s genome.

The codon CGU (corresponds to Arginine) is used about 42% of the time for Arginine in E. coli. Many codon optimization tools would replace all codons in a theoretical sequence with CGU. This is statistically unlikely to be the best choice given the hundreds of thousands of permutations available.

More advanced, are frequency-based tools. These will ensure that, after optimization, 42% of the codons for Arginine in the sequence would be CGU. To some extent, this can relieve the metabolic stress of expression that would be caused by a ‘one amino acid–one codon’ approach. However, even frequency-based tools don’t fully address the energetics of translation.

The context of specific codons is also essential to the translation process. For example, the codon CGU may find itself being close neighbors to certain codons. Or, patterns of codons may appear — context that can be difficult for traditional algorithms to identify on a genome-wide scale.

These technical (and biological!) aspects are discussed further in our publication.

Artificial intelligence, more specifically, recurrent neural networks (RNNs), can be used to detect codon context across a large sequence.

The Recurrent Neural Network

Recurrent neural networks (RNNs) are designed to detect temporal patterns. As such, they are perfect for analyzing codon context. This was the method behind our ICOR optimization tool, which we trained on a dataset of thousands of genes from E. coli.

We utilized the Bidirectional Long Short-Term Memory (LSTM) architecture to train the model. BiLSTMs are a form of RNNs, which, as the name suggests, may help preserve context both forward and backward in a sequence.

In training such a model, we slap a layer of natural language processing. Similar to how words construct sentences, codons construct a DNA/RNA sequence. Therefore, the input to the LSTM model was represented as a sequence of codon embeddings — performed via one-hot encoding methods.

ICOR

ICOR is an open-source software we have created that uses deep learning (see above) to learn codon usage in E. coli. We compile a dataset of over 7,000 non-redundant, high-expression, robust genes which are used for deep learning. Then, the model is trained to optimize sequences for high expression and to match codon usage in E. coli.

We test ICOR against several other codon optimization approaches, and deem that its performance is superior. The expression levels of our optimized sequences have an estimated 236% increase in mRNA expression compared to unoptimized genes.

Our tool is provided as an open-source Python package and is available on our accompanying GitHub repository.

We also establish a benchmark dataset for codon optimization, providing researchers a reliable way to evaluate the efficacy of their models.

Usage

ICOR (requires Python) can be installed in just three lines of shell script:

# Install package

git clone https://github.com/Lattice-Automation/icor-codon-optimization.git

# Install prereqs

pip install -r requirements.txt

# Run ICOR optimizer

python ./tool/optimizers/icor_optimizer.py

Our command-line codon optimizer accepts FASTA sequences as an amino acid or codon sequence. It will then return an optimized sequence for expression in E. coli.

ICOR command-line script for optimizing sequences!

Additional Tools & Scripts

Our software package also contains useful scripts for evaluating and testing out codon optimization techniques. We provide the following:

5 different optimization approaches to test (Background frequency choice, ICOR, Extended random choice, Uniform random choice, Highest frequency choice)
CDS conversion — a script that takes an input of DNA sequences and fetches their CoDing Sequences from the NCBI’s nuccore database.
Benchmarking — an interactive Jupyter notebook for benchmarking the optimization methods on FASTA sequences.
Codon map — a codon map for amino acid to codon conversion in a Pythonic framing (useful for future projects in this space!)
E. coli’s codon frequencies — the frequencies and weights of each codon/amino acid in E. coli.

Final Thoughts

By understanding and applying the context of codons, AI-powered codon optimization programs, such as ICOR, can push the boundaries of recombinant expression.

We imagine this tool being used in research and industry! Researchers can take advantage of an improved yield towards designing more viable proteins, and the industry can utilize ICOR for cheaper, efficacious yields.

We’d love to see other scientists and engineers collaborate with us on improving the ICOR approach. That’s why it’s fully open-source. Feel free to open up an issue on our GitHub repo, submit a PR, or even leave a star :)

Written by: Rishab Jain