Translation of RNA Sequences to Protein Sequences using JFlex Lexical Analyzer Generator

Roland Hewage
May 12, 2020 · 8 min read
Image for post
Image for post
Translation of RNA Sequences to Protein Sequences using JFlex Lexical Analyzer Generator

Lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer or scanner. A lexer is generally combined with a parser, which together analyzes the syntax of programming languages, web pages, and so forth.

What is JFlex?

JFlex Lexical Analyzer Generator

JFlex is a lexical analyzer generator (also known as scanner generator) for Java, written in Java.

  • A lexical analyzer generator takes as input a specification with a set of regular expressions and corresponding actions.
  • It generates a program (a lexer) that reads input, matches the input against the regular expressions in the spec file, and runs the corresponding action if a regular expression matched.
  • Lexers usually are the first front-end step in compilers, matching keywords, comments, operators, etc, and generating an input token stream for parsers.
  • JFlex lexers are based on deterministic finite automata (DFAs). They are fast, without expensive backtracking.
  • JFlex is designed to work together with the LALR parser generator CUP by Scott Hudson, and the Java modification of Berkeley Yacc BYacc/J by Bob Jamison. It can also be used together with other parser generators like ANTLR or as a standalone tool.

As with JLex, the specification consists of three parts, divided by %%:

As shown below, a lexical specification file for JFlex consists of three parts divided by a single line starting with %%:

Image for post
Image for post
Lexical specification file for JFlex

In all parts of the specification comments of the form /* comment text */ and Java-style end-of-line comments starting with // are permitted. JFlex comments do nest — so the number of /* and */ should be balanced.

Translation of RNA Sequences to Protein Sequences using JFlex

Image for post
Image for post
Translation of RNA Sequences to Protein Sequences using JFlex

Take a moment to look at your hands. The bone, skin, and muscle you see are made up of cells. And each of those cells contains many millions of proteins. As a matter of fact, proteins are key molecular “building blocks” for every organism on Earth.

Basically, a gene is used to build a protein in a two-step process:

  • Step 1: transcription: Here, the DNA sequence of a gene is “rewritten” in the form of RNA. In eukaryotes like you and me, the RNA is processed (and often has a few bits snipped out of it) to make the final product, called a messenger RNA or mRNA.
  • Step 2: translation: In this stage, the mRNA is “decoded” to build a protein (or a chunk/subunit of a protein) that contains a specific series of amino acids.
Image for post
Image for post
Central Dogma of Molecular Biology

During translation, a cell “reads” the information in a messenger RNA (mRNA) and uses it to build a protein. Actually, to be a little more techical, an mRNA doesn’t always encode-provide instructions for-a whole protein. Instead, what we can confidently say is that it always encodes a polypeptide, or chain of amino acids.

Image for post
Image for post
Codon Table

In an mRNA, the instructions for building a polypeptide are RNA nucleotides (As, Us, Cs, and Gs) read in groups of three. These groups of three are called codons.

There are 61 codons for amino acids, and each of them is “read” to specify a certain amino acid out of the 20 commonly found in proteins. One codon, AUG, specifies the amino acid methionine and also acts as a start codon to signal the start of protein construction.

There are three more codons that do not specify amino acids. These stop codons, UAA, UAG, and UGA, tell the cell when a polypeptide is complete. All together, this collection of codon-amino acid relationships is called the genetic code, because it lets cells “decode” an mRNA into a chain of amino acids.

Image for post
Image for post
RNA — Codons — Amino Acids

A sequence can represent either a nucleic acid (e.g., DNA or RNA) or a polypeptide chain representing amino acids. The central dogma of biology states that a segment DNA is transcribed and translated into a protein according to the following code in which every three letters of DNA (except three cases) become one letter of amino acid:

Image for post
Image for post
Every three letters of DNA (except three cases) become one letter of amino acid

For example the DNA sequence cuugaaauuucu would produce the amino acid sequence leis.

The following Jflex lexical analyzer generator generates a program that reads a fasta file containing valid RNA sequences and the same information is translated into amino acid sequences.

Image for post
Image for post
Transcription & Translation

A sequence in FASTA format consists of:

  • One line starting with a “>” sign, followed by a sequence identification code.
    It is optionally be followed by a textual description of the sequence. Since it is not part of the official description of the format, software can choose to ignore this, when it is present.
  • One or more lines containing the sequence itself.

A file in FASTA format may comprise more than one sequence.

The JFlex program takes as input the specification with the set of regular expressions and corresponding action. It generates a program (lexer) that reads the input from a fasta file which contains multiple RNA sequences, match the input against the regular expression to separately identify each RNA sequence and run the corresponding action if the regular expression is matched that translates each RNA sequence to the corresponding protein sequence. It also prints the relevant line no in fasta file for each sequence, count the no of sequences, and RNA & Protein sequence lengths.

The relevant JFlex specification is illustrated below.

Image for post
Image for post
Jflex Specification 1

The options & declarations section of the JFlex specification is shown above after %% symbol. %class RNAtoProteinFasta tells JFlex to give the generated class the name RNAtoProteinFasta and to write the code to a file RNAtoProtein.java. %line switches line counting on (the current line number can be accessed via the variable yyline). %column switches column counting on (the current column is accessed via yycolumn). Lexical states defined using %state is used to refine a specification. A lexical state acts like a start condition. Here we define 2 lexical states PATTERN, TEMP as shown above.

A macro definition has the form:

Image for post
Image for post
macro definition

That means, a macro definition is a macro identifier (letter followed by a sequence of letters, digits or underscores), that can later be used to reference the macro, followed by optional white-space, followed by an =, followed by optional white-space, followed by a regular expression. The regular expression on the right hand side must be well formed and must not contain the ^, / or $ operators. Here I have defined 5 macros with the corresponding regex to identify Whitespaces, newlines, line terminators, sequences and any input character followed by a line terminator. It’s shown in the above code segment.

Image for post
Image for post
Jflex Specification 2

The code enclosed in %{ and %} is copied verbatim into the generated class. Here you can define your own member variables and functions in the generated scanner. Like all options, both %{ and %} must start a line in the specification. If more than one class code directive %{…%} is present, the code is concatenated in order of appearance in the specification. Here I have defined the function which is used to convert the RNA sequence to the Protein sequence by breaking the string into triplet codes called codons and the corresponding amino acid is matched.

Image for post
Image for post
JFlex specification 3
Image for post
Image for post
JFlex Specification 4

A lexical state acts like a start condition. If the scanner is in lexical state TEMP, only expressions that are preceded by the start condition <TEMP> can be matched. A start condition of a regular expression can contain more than one lexical state. It is then matched when the lexer is in any of these lexical states. The lexical state YYINITIAL is predefined and is also the state in which the lexer begins scanning. If a regular expression has no start conditions it is matched in all lexical states. In the above code segment during the predefined state it checks for the symbol ‘>’ and the corresponding line number of the sequence in the fasta file, the sequence counter & the sequence details followed by > symbol are printed when it change to TEMP state. Then if a line terminator is found it change to the PATTERN state.

Image for post
Image for post
JFlex Specification 5

The PATTERN state deals with the RNA sequence section in fasta file. If a sequence line is found it concaternates each sequence line after trimming & the sequence length, the concatenated sequence is translated to the protein sequence using translation function. It prints the RNA sequence length, RNA sequence, Protein sequence length, protein sequence. Then it goes back to the initial state.

The program take as input a fasta file that contains RNA sequences as given in the following example.

Image for post
Image for post
Input fasta file with RNA sequences

The JLex program is run using the following commands. Then the corresponding lexer translates each RNA sequence to the corresponding Protein sequence (a token) & displays the result in the terminal.

Image for post
Image for post
How to run Jflex program
Image for post
Image for post
Output tokens of protein sequences

Stay tuned for more amazing articles. Thank you.

The Startup

Medium's largest active publication, followed by +773K people. Follow to join our community.

Roland Hewage

Written by

(Born to Code) | Former Software Engineer Intern at WSO2 | Bachelor of Computer Science (Special) Degree Undergraduate at University of Ruhuna, Sri Lanka

The Startup

Medium's largest active publication, followed by +773K people. Follow to join our community.

Roland Hewage

Written by

(Born to Code) | Former Software Engineer Intern at WSO2 | Bachelor of Computer Science (Special) Degree Undergraduate at University of Ruhuna, Sri Lanka

The Startup

Medium's largest active publication, followed by +773K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store