Common Plasmid Detection Tools

Tools that facilitate separation of plasmid contigs from chromosomal contigs

--

Plasmids play an important role in the functional adaptation of species for different environments. Sometimes plasmids can be circular in nature. Furthermore, plasmids can appear with a copy number which means it can occur as few times as its corresponding chromosomes. Because of these and many other numerous interesting facts, the recovery of plasmids has become an interesting research topic.

Researchers in the field have come up with intriguing methods to address the need for plasmid recovery. Such methods include both lab culturing of bacterial samples and insilico methods.

Photo by National Cancer Institute on Unsplash

In this article, I will present two recent tools I found that uses machine learning to separate plasmids and chromosomes.

PlasFlow (2018, Nucleic Acid Research)

This tool was developed by Pawel S Krawczyk and others in 2018. The tool uses machine learning (a deep neural network) to classify contigs from assemblies into their corresponding class (chromosome or plasmid) along with its phylum. The interesting point is the classifier is trained for the classes under the phylum level which reduces the chances of overlap between the two classes, chromosome, and plasmid.

Training Workflow

  • Obtain all the plasmid and chromosome reference sequences.
  • Obtain sequence fragments of lengths 5kb, 10kb, and 50kb.
  • Compute k-mer count vectors for k=5, k=6, and k=7.
  • Perform Tf-IDF transformation.
  • Train a neural network and pick the best models for each k and length.

Prediction Workflow

  • Use the best models to predict probabilities.
  • Use a voting classifier to obtain the final result

You can find the detailed workflow here (Not included due to possible copyright issues).

Output

PlasFlow Result opened in Spreadsheet

The most vital column is the label column. The PlasFlow output consists of 26 possible classes under different phyla. The columns following the label column present the probability of the sequence belonging to the class of the column header. The label is determined using the passed threshold parameter which is 0.7 by default.

  • Output Classes of PlasFlow
    chromosome.Acidobacteria, chromosome.Actinobacteria, chromosome.Bacteroidetes, chromosome.Chlamydiae, chromosome.Chlorobi, chromosome.Chloroflexi, chromosome.Cyanobacteria, chromosome.DeinococcusThermus, chromosome.Firmicutes, chromosome.Fusobacteria, chromosome.Nitrospirae, chromosome.other, chromosome.Planctomycetes, chromosome.Proteobacteria, chromosome.Spirochaetes, chromosome.Tenericutes, chromosome.Thermotogae, chromosome.Verrucomicrobia, plasmid.Actinobacteria, plasmid.Bacteroidetes, plasmid.Chlamydiae, plasmid.Cyanobacteria, plasmid.DeinococcusThermus, plasmid.Firmicutes, plasmid.Fusobacteria, plasmid.other, plasmid.Proteobacteria, plasmid.Spirochaetes

PlasClass (2020, PLOS)

This tool was developed by David Pellow and others in 2020. The tool utilizes logistic regression for the two classes plasmid and chromosome. The model is developed as a binary classification problem in contrast with PlasFlow which uses the multi-class classification approach.

Training Workflow

  • Obtain all the plasmid and chromosome reference sequences.
  • Obtain sequence fragments of lengths 1kb, 10kb, 100kb and 500kb.
  • Compute normalized k-mer frequency vectors for k=3, k=4, k=5, k=6, and k=7 and combined into one vector for each read. Each read is represented by a vector of 10,952 dimensions.
  • Scale the data along the dimensions and train a logistic regression classifier.

Prediction Workflow

  • For each sequence compute the k-mer frequency vectors and predict the plasmid probability.

Output

PlasClass outputs the sequence id and the predicted plasmid probability. Users can decide on a threshold to filter the plasmids from chromosomes.

Concluding Remarks

Although there are few more tools in this article I have not written about them for simplicity. I will have a visit to those tools in a future article.

The precise classification of plasmids from chromosomes is yet to be discovered and remains to be an active research area to this date. This might as well be a good choice of project for someone who's looking forward to joining the area of Bioinformatics research.

I hope this article was interesting for the readers. Cheers!

--

--