Explained: The hard evidence why the SARS-CoV-2 genome was not engineered and unlikely leaked

Philipp Markolin
Advances in biological science
11 min readJun 21, 2021

The science behind ACE2-domain adaptations, furin cleavage site and ‘CGG’ codon usage among other considerations

Molecular dated Bayesian phylogeny of RBP region 5 showing the 9 closest relatives to SARS-CoV-2. SARS-CoV-2 likely shared a common ancestor with RaTG13 about 40 years ago. (Lytras et al., bioarxiv, 2021)

Introduction

The origins of SARS-CoV-2, the virus responsible for Covid-19, have been investigated since at least January 2020, when the virus has been isolated and sequenced from patient samples in Wuhan, China. Once the pandemic started reaching the US, Europe and the rest of the world, many scientists had put their other projects on hold in order to help dealing with the many unknowns the disease brought with it. From spread modeling to hygiene measures, from contact tracing to antiviral agents and vaccine development; no area of the pandemic was left out from investigation. However, nobody was put more to tasks than virologists all around the world; thousands of virus experts suddenly had no more pressing task than to dive into every aspect of this novel coronavirus, with all the institutional, societal and financial support they needed to perform their experiments. This includes many studies toward the potential origins of the virus.

Relatively quickly, considering genomic, historical and geographical evidence, a scientific consensus emerged that SARS-CoV-2 was most likely the product of a zoonotic jump from horseshoe bats to an intermediate mammal population (currently unknown), from where it evolved traits to finally infect humans effectively.

Fast forward a year, and that consensus has found itself questioned by popular and political figures alike in favor of more agent-based scenarios, from deliberate genome engineering to accidental lab leak, despite the scientific evidence remaining largely unchanged.

So this article will dive into the scientific arguments made by proponents of human origin (lab leak or genome engineering) hypotheses of SARS-CoV-2, explain their rational and assess them on their merit.

Scientific arguments used by proponents:

A) The ACE2-binding domain of the SARS-CoV-2 spike protein shows an ‘optimized’ binding for human ACE2 receptors when compared with it’s closest relative, a bat coronavirus named RaTG13.

  • The ACE2 -binding protein domain in Sars-CoV-2 has 6 amino acid modifications (compared to other related bat coronaviruses like RaTG13, see Figure 1a below) that allow it to better bind human ACE2; proponents claim that this could mean they were engineered.
a, Mutations in contact residues of the SARS-CoV-2 spike protein. The spike protein of SARS-CoV-2 (red bar at top) was aligned against the most closely related SARS-CoV-like coronaviruses and SARS-CoV itself. Key residues in the spike protein that make contact to the ACE2 receptor are marked with blue boxes in both SARS-CoV-2 and related viruses, including SARS-CoV-1 (Urbani strain). b, Acquisition of polybasic cleavage site and O-linked glycans. Both the polybasic cleavage site and the three adjacent predicted O-linked glycans are unique to SARS-CoV-2 and were not previously seen in lineage B betacoronaviruses. [Figure 1, Andersen et al., 2020]
  • However, these 6 are not modifications seen in SARS-CoV-1 (from 2002 outbreak), SARS-CoV-1 modifications to spike also increase it’s receptor binding toward human ACE2 (albeit to a lesser extend than SARS-CoV-2). Furthermore, the 6 modifications found in Sars-CoV-2 are not obvious, nor are they perfectly optimized, because computational models would have predicted other sites first. If SARS-CoV-2 was indeed engineered, why would scientists not take the already optimized amino acid modifications known from Sars-CoV-1 and enhanced from there? Or optimize binding by following model predictions? This seems like an unnecessary engineering risk.
  • Furthermore, the identical 6 amino acid modifications in SARS-CoV-2 ACE2-binding domain have been found in the spike protein of coronaviruses naturally occuring in Pangolins. This is evidence that selection pressures can certainly produce these modifications (and that Pangolins might have a somewhat similar ACE2 receptor to humans). The fact that SARS-CoV-2 can infect mammals (ferrets, cats, dogs, etc…) more broadly also goes in favor of natural selection (and against potential design choices genome engineers would take)

B) The furin-cleavage site at the junction S1-S2 in the spike protein is not found in closest related family of SARS-CoV-2 but increases viral infectivity and host range dramatically

A) Multiple sequence alignment of representative Betacoronavirus spike protein S1/S2 region, with furin recognition motifs highlighted (red colorboxes in sequence alignment). Phylogenetic tree of spike protein sequences is colored to indicate subgenera. B) Positions of furin cleavage sites in different coronavirus genera (red cartoon, furin recognition motif; red arrow, cleavage site); structures: SARS-CoV-2, PDB ID 6VYB (Walls et al., 2020), with missing loop added; feline coronavirus (FCoV) UU16, homology model based on PDB ID 5SZS (Walls et al., 2016); infectious bronchitis coronavirus (IBV), PDB ID 6CV0 (Shang et al., 2018) [Figure 6, Wu et al., Stem Cell Research, 2021)
  • Proponents of human origin theories claim that the furin cleavage site is strong evidence for either gene engineering or human selection experiments, as similar experiments have been done before in Gain-of-Function research (e.g Folly et al., Virology, 2006 introduced a furin cleavage site in the Spike protein of SARS-CoV-1, or Menachery et al., J. Virol., 2020 showed in MERS-CoV that proteolytic cleavage of spike protein is a primary infection barrier viruses need to overcome)
  • While it is true that arginine (‘R’) motifs have been introduced in Sars-CoV-1 studies to build an artificial furin cleavage site, scientists in these experiments introduced a specific sequence (‘RR-x-RR’), a so-called double arginine motif (this yields a higher chance of the motif to be recognized and cut by furin proteases). In SARS-CoV-2, we find the sequence (‘PRR-A-R’) [Figure 1b, above], lacking the double motif. This specific sequence (‘PRR-A-R’) makes a strong case against the furin cleavage site being engineered:
  • First, this is inefficient. Why not add an additional arginine (for example: ‘PRR-A-RR’) in the end to increase your chances of cleavage? It’s really a missed opportunity from an engineering perspective. You decrease your odds of successful protease cleavage by half for no reason.
  • Second, there is a proline (‘P’) amino acid in the beginning of the cleavage site. Why introduce more amino acids than necessary for the proteolytic cleavage? Adding extra amino acids just risks the integrity of the spike protein through misfolding or unwanted space/interactions. Also, it is widely known that prolines are so-called helix-breakers, they disrupt secondary structure (alpha helices) of the protein, giving extra likelyhood to causing dysfunction. Prolines are about the last amino acids any scientist would introduce as a ‘spacer’.
  • Third, glycosylation. The addition of the proline before the furin cleavage site has the effect of opening up the residues S673, T678 and S686 to glycosylation (the attachment of sugars to amino acids; proteins are to varying degrees covered by sugars, especially proteins that are secreted and need to be soluble). Glycosylation patterns are hard to predict and might mess up protein-protein interaction; an engineer would not want to risk that, because glycosylation might just block the arginines to be recognized by proteases for cleavage. However, because of it’s ability to block protein-protein interactions, glycosylation can provide a shild against immune systems (Bagdonaite & Wandall, Glycobiology, 2018). So the unusual integration of the proline residue opens up the region and causes the glycosylation of the surrounding area, potentially staving off the immune system but not interfering with protease cleavage (there is evidence that SARS-CoV-2 gets indeed cleaved efficiently in the furin cleavage site).

It’s a classic case of: the proline residue is not needed, conventional scientific wisdom predicts it would be terrible there, but evolution still selected for it and made it work in an unpredictable and brilliant way.

Scientists wished they had the knowledge and competence to design something like that from scratch; but all aspects of the furin cleavage motif (‘PRR-A-R’) strongly point towards evolution by natural selection. This proline addition in the furin cleavage site is also evidence against SARS-CoV-2 ‘evolving’ in human cell cultures, since these could not have selected for it. (Cell cultures don’t have a proteolytic immune system, only animal models or humans do). More on the implications later.

C) The furin-cleavage site has unusual codon usage; Arginines (R) can be encoded by multiple combinations of triplets, yet we find CGGs, which are extremely rare in bats by common in humans

  • Proponents of human origin hypotheses allege that the specific nucleotide combination of CGG for both arginines (R) has a very low probability to naturally occur making human engineering of SARS-CoV-2 a comparatively likely scenario
Amino acid translation table. The 3-letter-code on the left is called a ‘codon’, codons are read by specialized tRNAs to deliver their corresponding amino acids (right) to the ribosome for protein assembly.
  • It is true that bat strains like RatG13 have a different codon preference for encoding arginines, and some previous human coronaviruses might not even use CGG at all (Hou, Virology Journal, 2021). However, about 3% of all arginines (R) are encoded by the ‘CGG’ triplet and spread out all over the SARS-CoV-2 genome (size ~ 30.000 bp), not only the furin cleavage site. On top of that, the SARS-CoV-1 (whose origins have been proven natural) uses ‘CGG’ to encode about 5% of arginines (R) in its genome. So finding two CGG-arginines together might just be a low-probability coincidence (but decidely not impossible).

However, there might even be a more satisfactory explanation given by natural selection. To follow this line of argumentation, we have to first understand why there are differences in triplet usage between organisms? It has mainly to do with 2 points; susceptibility to mutations and protein translation efficiency.

  • susceptibility to mutations: organisms are exposed to different environments and threats to their genome integrity. For example, expose to UV-radiation causes for example increased C-T mutations, certain chemicals might cause G-A, certain biological processes like faulty DNA repair maybe A-T; Given that environmental background and the fact that so called Wobble bases (redundant base pairs at the triplet end) are less conserved, evolution will seek to optimize towards a genetic code and triplet usage that maintains robustness for the organism (and different codon preferences is one of those optimizations, for example be decreasing reliance on codons more susceptible to a specific type of base mutation). The clearest case we have for these are thermal extremophiles, e.g bacteria that live in vulcanic hot springs. Their genetic code is enriched for C and G containing codons, because the high temperature poses and existential risk of DNA doubestrand melting by having too many weaker-binding As and Ts.
  • translational efficiency: Codon triplets are read by specialized tRNAs that deliver the right amino acids to the ribosome (the protein factory). Depending on the expression levels of the respective tRNAs, low abundance tRNA deliver slower than high abundance ones, thus slowing down and making the ribosome stall sometimes. Sometimes slowing the ribosome down might be useful for large proteins to guarantee time for proper folding, or giving time for the RNA template to get the last splicing done.

What does that have to do with viruses?

Viral proteins are usually optimized for quick turnover, the more they can speed through the ribosome, the more units of them get produced, the faster they replicate. Viruses have a high interest in getting a codon usage for their amino acids where the organism’s tRNA is abundant (remember, viruses need the host organism’s tRNAs and ribosomes to replicate), so as to not delay viral production. (Some protein synthesis experiments recently done (Dasari et al., Infect. Genet. Evol., 2020) report that SARS-COV-1, SARS-COV-2 and MERS-COV all have increased protein production compared to other human coronaviruses strains who might not use ‘CGG’ to encode arginine (R)) Again, SARS-CoV-1 and MERS-CoV both use ‘CGG’ codons and are of natural origin.

So in order to optimize viral production, it is consistent with evolutionary and natural selection pressures to ‘switch’ codon usage (or more precisely: ‘maintain certain nucleotide mutations over others’) according to the preference of the host organism where these viruses replicate in.

The fact that SARS-CoV-2 has some ‘CGG’ codons (especially in aquired ‘novel’ regions like the furin cleavage site) is therefore entirely consistent with it evolving in a host mammal (or even human) population whose codon preference and tRNA usage resembles ours more closely, similar to SARS-CoV-1 and MERS-CoV.

No need for engineering, just natural selection.

Conclusion

Most of the evidence from genome sequence or protein function strongly refutes any engineering intent or design attempt. The SARS-CoV-2 genome shows plenty of hallmarks of being evolved, within and outside of the ACE2-binding domain and furin cleavage site, and proponents of ‘engineering’ scenarios are lacking conceivable explanations to account for them. Deliberate genome engineering can therefore be ruled out with high confidence for SARS-CoV-2.

Evolution can however happen in natural or lab environments, and without further evidence, it will be difficult to completely rule out lab leak as a potential (though increasingly unlikely) start of the Covid-19 pandemic.

Ruling out genomic engineering leaves very few, low probability alternative explanations for lab leak:

  • Natural selection was ‘supercharged’ in a lab, e.g by experimentation in human cell lines (the proline (P) selection and glycosylation argues against that) or serial passage in non-bat animal populations (Some have suggested transgenic mice with humanized ACE2 receptors). In this scenario, a lot of things must have gone wrong and unnoticed in the lab before Wuhan researchers got hit first by the zoonotic jump from host animals and then spread it unknowingly to the outside; wet markets were just first superspreader events because of their density for the pandemic to get noticed. In response, the researchers killed the host populations and all evidence and records after they saw the virus spreading in the wet markets to evade local authorities. The internal cover-up was perfect and there were no whistle blowers.
  • the ‘ancient lab leak’; an ancestor-version of Sars-CoV-2 was leaked from the lab (and all evidence for that ancestor was destroyed and nothing was ever published) and that ancestor slowly acquired the furin cleavage site later independently in the ‘wild’, be it in humans or animals (the lesser-infectious ancestor virus lacking the furin cleavage site would have had to fly under the radar for years in a considerably sized population for this to be true; and weakly infectious viruses are not known to survive well that long, making the whole scenario also very low probability).
  • ‘deliberate natural engineering’: critical aspects we know about evolutionary biology, selection, structural biology, protein engineering techniques of viruses are wrong and on top of that, Chinese virologists were incredibly brilliant and competent so they could use our false preconceptions of virology to fool the rest of the wider world of scientific experts by intentionally designing a virus that looks perfectly evolved. Also, the cover-up was perfect and there were no whistle blowers.

The three ‘leak’ scenarios are theoretically possible but seem an order of magnitude less likely than a zoonotic jump from bat to an intermediate host animal decades ago (and we have good arguments and models for that, see e.g Robert F. Garry, virological, 2021 and Lytras et al., bioarxiv, 2021 among many others).

From there, it is perfectly consistent with natural selection that a more optimized ACE2 binding and furin cleavage site and CGG codon usage evolved in that intermediate animal population before the jump to humans.

Summary

This article exposes the shallow basis of the prominent ‘science based’ supporting arguments for both genome engineering or lab leak scenarios. The article’s exclusive focus on the SARS-CoV-2 genome served two functions; first, the full genome sequence is publicly available and undisputed. Second, it is sufficient to cut through the current noise of grief psychology, motivated reasoning, political narratives and controversial geopolitical actors which are at best secondary or supporting considerations, but cannot stand on their own.

Science doesn’t deal in absolute certainties, and is always open to new compelling evidence to update its assessment of reality. However, this doesn’t mean that each conceivable explanation is equally likely to be true, quite the opposite. The scientific method allows us to narrow down, rule out or contextualize scenarios with a high degree of confidence. There are good reasons why the scientific consensus has moved very little from a natural origin scenario, but this does not mean scientists are not doing their jobs or due diligence by following every lead and searching for more evidence.

For those interested in the science behind the discussion, this article provides a reasonable probability assessment of the various origin theories based on the hard scientific evidence we have to date. The true origins might however remain elusive for a very long time.

Lastly, even if we are lucky and definite proof for a natural origin of SARS-CoV-2 eventually arrives, it ought not absolve virologists and regulators from their newfound responsibility towards the public, because the next pandemic will come, and lab leaks will continue to pose a likely avenue for desaster. But that is a whole other discussion.

--

--

Philipp Markolin
Advances in biological science

Science holds the keys to a world full of beauty and possibilities. I usually try something new.