Python, Meet Synthetic Biology (The DNA + Computing Issue)

Cell Crunch (Issue 2021.03.01)

Published in

Codon

8 min readMar 1, 2021

Reach out on Twitter with feedback and questions. Receive this free newsletter every Friday morning by clicking here.

The sight of a feather in a peacock’s tail, whenever I gaze at it, makes me sick!
— Charles Darwin (from a letter to Asa Gray)

Data Stored in Yeast Chromosome: A couple weeks ago, researchers at Tianjin University, in China, dropped a paper that I completely missed. I was scrolling through Twitter — doomscrolling, you might say — when I saw a tweet from Tom Ellis about data stored on an artificial yeast chromosome. I thought it was intriguing, and decided to feature it in this newsletter, despite its age (two weeks old? Ugh.)

In that study, published in National Science Review, the Tianjin team stored “two pictures and a video clip” on an artificial Baker’s yeast chromosome about 250,000 bases in length. More than 95% of the chromosome’s length was used to encode data, according to the study authors. The yeast carrying the artificial chromosome were passaged for 100 generations, and the data could still be retrieved (via Nanopore sequencing) after that time.

To make the data encoding robust — so that mutations in the chromosome would not destroy the data, or make it irretrievable — the researchers used “superposition of sparsified error correction codewords and pseudo-random sequences.” I’m not even going to pretend like I understand that jargon, but the moral is this: the yeast data is not easy to destroy!

Why It Matters: Building a 250,000 base pair chromosome is expensive. But once you’ve actually done that, this study proves that there are ways to ensure that it will remain stable, and its data retrievable, for at least 100 generations. This technique, then, might be a great way to store and copy DNA data files in the future; you could just freeze cells carrying the artificial chromosome, and grow them up when you need to duplicate the data. The authors also wrote, in the discussion section, that it is theoretically possible to have multiple different artificial chromosomes, inside of living cells, that could each be dedicated to data storage. This approach might become more popular as the cost of DNA synthesis comes down.

Credit: Tomas Brunsdon | Giphy

Whence the DNA Came?: As it becomes easier to read and write DNA — and as the number of sequences stored in databases continues to expand exponentially — biosecurity, IP infringement, and general misuse of those sequences has become a concern. That’s why, in recent years, many labs have tried to create tools to identify the lab-of-origin for engineered DNA sequences. Those tools — which typically rely on machine learning — could help biosecurity experts track down the source of bioweapons, for example. Unfortunately, lab-of-origin tools based on machine learning suffer from limited explainability (their predictions are a black box), require a large computational cost, and rely on extensive training data to increase their predictive power.

For a new study, in Nature Communications, researchers at Rice University, in Houston, throw away machine learning entirely. The team instead used common and “signature” sequences — those known to have come from a specific lab — to develop a “pan-genome” that captures all available synthetic plasmid sequences. When a new, engineered sequence crops up, it can be compared to the pan-genome to see where that new sequence is likely to have come from, using standard gene alignment tools.

The new approach, called PlasmidHawk, “can successfully predict the lab-of-origin of an engineered DNA sequence 75.8% of the time,” according to the study authors. “Around 85.2% of the time the source lab is included in the top 10 predicted labs.” It also requires less computational power, and its results are human-interpretable. PlasmidHawk’s code and installation instructions can be found here. This work was led by researchers at Rice University, in Houston.

Why It Matters: In 2018, Alec Nielsen and Chris Voigt published a machine learning based approach for lab-of-origin prediction that could correctly identify source labs 48% of the time. Another recent study, in Nature Communications, used a Recurrent Neural Network method, called deteRNNt, to achieve a similar goal; that technique could predict the lab-of-origin for engineered DNA with an accuracy up to 70%. This new method, which doesn’t use machine learning, is more accurate than both of them. This study could bolster biosecurity and help bio-forensic investigators identify likely culprits should a biological threat escape from a laboratory.

Python for Synthetic Biology: While some biologists (and billionaires) equate genetic engineering with computer programming, convincing a cell to do something is much harder than punching a few for loops into VSCode and pushing to GitHub. Biology is messy, chaotic, inconsistent; it is hard to get it right. But software can still help.

For a new study, published in Synthetic Biology, an international team of researchers unveiled SynBiopython, an open-source Python library for synthetic biologists. The package’s documentation includes some useful modules (which the authors claim will continue to grow and expand into the future). The key advancement in this library is SynBiopython’s inclusion of a “universal environment,” called Genbabel, that can help convert between different file types.

DNA and protein sequences come as GenBank and FASTA files, for example, while the Synthetic Biology Open Language, or SBOL, uses its own SBOL format. SynBiopython aims to provide a link between the two; it can convert between SBOL files and GenBank/FASTA sequences, which the authors say will “improve reusability” of code. The work was led by researchers at the National University of Singapore.

Why It Matters: The Synthetic Biology Open Language, or SBOL, already has great libraries for Python and JavaScript, and offers a ton of tools for doing things like designing DNA sequences and genetic circuits, or for plotting genetic constructs. And this new package is, admittedly, quite limited in terms of built-in modules. But the idea of an open-source Python package that can use all the different file formats that synthetic biologists typically work with has serious value. With improvements, SynBiopython could bolster the reproducibility of DNA design, codon optimization, guide RNA design, and a slew of other things that scientists do on a regular basis.

Has your laboratory developed its own “in-house” library to automate common tasks? I’d like to check it out; tell me about it in a comment on this post.

More than 10,000 gigabytes of data can be stored in the faint pink smear of DNA at the end of this test tube. [Credit: Tara Brown Photography | University of Washington]

This thread, from @MoKhalilLab, is worth your time. It includes 15 tweets. 👇

🧫 Other Studies Published This Week

Artificial Cells

Optimized cDICE for efficient reconstitution of biological systems in giant unilamellar vesicles. bioRxiv (preprint). Link

Biomaterials

Smart biomaterials: state-of-the-art of functional scaffolds for 3D nervous tissue regeneration (Review). Frontiers in Bioengineering and Biotechnology. Link

Biomanufacturing

Development of a platform process for the production and purification of single‐domain antibodies. Biotechnology and Bioengineering. Link

Biosensors

Whole-cell microbial bioreporter for soil contaminants detection (Review). Frontiers in Bioengineering and Biotechnology. Open Access. Link

DNA Storage & Nanotechnology

Rewritable two-dimensional DNA-based data storage with machine learning reconstruction. bioRxiv (preprint). Link
Determinants of ligand-functionalized DNA nanostructure-cell interactions. bioRxiv (preprint). Link

Fundamental Discoveries

Spatiotemporal dissection of the cell cycle with single-cell proteogenomics. Nature. Link
In vivo CD8+ T cell CRISPR screening reveals control by Fli1 in infection and cancer. Cell. Link
In vivo CRISPR screening reveals nutrient signaling processes underpinning CD8+ T cell fate decisions. Cell. Link
Combinatorial CRISPR screen identifies fitness effects of gene paralogues. Nature Communications. Open Access. Link
Re-defining synthetic lethality by phenotypic profiling for precision oncology (Perspective). Cell Chemical Biology. Link
Gene editing and synthetically accessible inhibitors reveal role for TPC2 in HCC cell proliferation and tumor growth. Cell Chemical Biology. Link

Genetic Engineering & Control

Sequence-independent RNA sensing and DNA targeting by a split domain CRISPR–Cas12a gRNA switch. Nucleic Acids Research. Open Access. Link
Accelerating target deconvolution for therapeutic antibody candidates using highly parallelized genome editing. Nature Communications. Open Access. Link
Microbial single-strand annealing proteins enable CRISPR gene-editing tools with improved knock-in efficiencies and reduced off-target effects. Nucleic Acids Research. Open Access. Link
Polyvalent guide RNAs for CRISPR antivirals. bioRxiv (preprint). Link
Efficient in vivo genome editing mediated by stem cells-derived extracellular vesicles carrying designer nucleases. bioRxiv (preprint). Link
Evaluating capture sequence performance for single-cell CRISPR activation experiments. ACS Synthetic Biology. Open Access. Link

Medicine and Diagnostics

Enhancement of liver-directed transgene expression at initial and repeat doses of AAV vectors admixed with ImmTOR nanoparticles. Science Advances. Open Access. Link
Biological activity-based modeling identifies antiviral leads against SARS-CoV-2. Nature Biotechnology. Open Access. Link
Advances in bioreactors for lung bioengineering: From scalable cell culture to tissue growth monitoring. Biotechnology and Bioengineering. Link
A CRISPR response to pandemics? Exploring the ethics of genetically engineering the human immune system. EMBO Reports. Open Access. Link
Acquired cancer cell resistance to T cell bispecific antibodies and CAR T targeting HER2 through JAK2 down-modulation. Nature Communications. Open Access. Link

Metabolic Engineering

Biofuels for a sustainable future (Review). Cell. Link
Genetically engineered methanotroph as a platform for bioaugmentation of chemical pesticide contaminated soil. ACS Synthetic Biology. Link
Reductive glycine pathway: A versatile route for one-carbon biotech. Trends in Biotechnology. Link
Directed evolution of propionyl-CoA carboxylase for succinate biosynthesis. Trends in Biotechnology. Link
Optimizing a fed-batch high-density fermentation process for medium chain-length poly(3-hydroxyalkanoates) in Escherichia coli. Frontiers in Bioengineering and Biotechnology. Open Access. Link
Lactic acid production by Clostridium acetobutylicum and Clostridium beijerinckii under anaerobic conditions using a complex substrate. bioRxiv (preprint). Link
Efficient production of 1,3-propanediol from diverse carbohydrates via a non-natural pathway using 3-hydroxypropionic acid as an intermediate. ACS Synthetic Biology. Link
Engineering microorganisms for the biosynthesis of dicarboxylic acids (Review). Biotechnology Advances. Link

New Technology

A synthetic RNA-mediated evolution system in yeast. bioRxiv (preprint). Link
In situ single-cell activities of microbial populations revealed by spatial transcriptomics. bioRxiv (preprint). Link

Microbial Communities

Ecological rules for the assembly of microbiome communities. PLOS Biology. Open Access. Link

Systems Biology & Modelling

Reproducibility in systems biology modelling (Commentary). Molecular Systems Biology. Open Access. Link
BMSS2: a unified database-driven modelling tool for systematic model selection and identifiability analysis. bioRxiv (preprint). Link

Tools to Study the Genome

Programmable tools for targeted analysis of epigenetic DNA modifications. Current Opinion in Chemical Biology. Open Access. Link
In situ genome sequencing resolves DNA sequence and structure in intact biological samples. Science. Link
REMY: A platform for the rapid interrogation of epigenome modifications on yeast. bioRxiv (preprint). Link
Spatially mapped single-cell chromatin accessibility. Nature Communications. Open Access. Link

Miscellaneous Topics

Comparison of E. coli based self-inducible expression systems containing different human heat shock proteins. Scientific Reports. Open Access. Link
Genetic code expansion of Vibrio natriegens. Frontiers in Bioengineering and Biotechnology. Open Access. Link

Until next time,

— Niko

Thanks for reading Cell Crunch, part of Bioeconomy.XYZ. If you enjoy this newsletter, please share it with a friend or colleague. Reach me with tips and feedback on Twitter @NikoMcCarty or via email.

NEWSLETTER