Macula X — Reimagining the Future of Protein Structure Determination

Proteins do almost everything in our cells. Why don’t we know what most of them look like?

Mukundh Murthy
MaculaX Therapeutics
10 min readApr 22, 2020

--

Proteins pretty much do everything within our cells. From manipulating our genetic material to carrying out metabolic function, an understanding of how our proteins carry out these functions is crucial to a detailed understanding of how the human body and physiological process works.

It goes beyond scientific curiosity though — detailed protein structures down to the Angstrom level (1/10 of a nanometer) are often required for drug development, as proteins are often the “targets” that we’re looking to activate or inhibit with small molecule drugs.

Though numerous developments have come in the field of protein structure determination over the past few years, most developments remain costly or purely ineffective. Let’s take a look at DeepMind, Alphafold’s recent innovation in the protein folding problem, which uses Deep Learning (more specifically, reinforcement learning) to figure out the structure of proteins.

It’s undoubtedly enhanced our ability to fold proteins de novo — that is, without the aid of experimental data. Nevertheless, its resolutions, as evidenced by recent protein folding competitions such as CASP, is merely 11–13Å, which is almost 4–5 times the resolutions used for drug discovery.

Traditional structure determination methods, namely X-Ray diffraction, and Nuclear Magnetic Resonance, require the use of expensive machinery (~7 million + maintenance costs) and are inherently unpredictable, often requiring near to weeks or months of trial and error to get one result. Especially for academic labs across the world with a lack of funding, X-ray diffraction poses a major bottleneck to the research pipeline.

XRD traditionally works by refracting X-rays off of orderly crystal arrays of crystallized proteins. In essence, it allows researchers to generate averaged maps of the electron density within the protein and use mathematical techniques such as fourier transform to visualize these electron density maps as protein structures. Some proteins have “intrinsically disordered regions;” in essence, these are the most flexible domains of a protein that actually carry out most of the cellular manipulation and function. These domains appear differently in different crystals, and XRD approximates the electron density as a smear on the final product. These residues fail to appear in the final structure.

The best way to avoid all of these issues would be to acquire methods for single protein-imaging — that is, imaging a single protein rather than an ensemble of the same molecule.

The consequences of inefficient protein structure determination extend far beyond the realm of academia and pure biological sciences. For example, look at the most recent pandemic faced worldwide — the SARS-COV-2 pandemic. Access to major protein targets such as the SARS-COV2 main protease (also known as 3Clpro) along with other structural proteins and members of its replicase complex was delayed by one to weeks, a delay that likely constituted part of the bottleneck that cost the world hundreds of thousands of lives.

Here, we propose a novel method to structuring proteins, one involving a simple cocktail of nanomaterials and nanopore methods.

We see three main uses for our technology:

  • Accelerating the field of drug discovery — Traditional drug discovery takes an extremely long time (anywhere from 1–2 decades). One of the main bottlenecks here is the high barrier to entry for accessing quality protein structures. We are unable to gain access to the structures of the protein targets of interest, which provide key structural and chemical constraints on the properties of the molecules that we must design and synthesize as potential candidates.
  • In addition, lack of understanding surrounding molecular mechanisms induces naivety through our inability to obtain a complete picture of the molecular landscape — this incomplete understanding of the protein landscape of a pathology can lead to unnecessary toxicity and side effects. In other words, if we’re able to correctly gain access to the structure of proteins and construct potent small molecules keeping multiple cellular pathways and mechanisms involving proteins in mind, would we need clinical trials? Clinical trials are merely a reflection of scientists’ inability to understand the innate complexity of human biochemical transduction pathways and cascades. Opening the door to accessing protein structures would act as a large step towards faster approval of medicines.
  • Ultimately, an understanding of complex biology lends itself to drug discovery — as the ultimate goal is to cure society of ailments. However, in order to even begin to propose drugs, we need to understand a mechanism by which the disease works. To date the protein data bank has 2227 proteins, while scientists predict that the human body has between 10,000 and 10 billion different species of proteins. How do we expect to efficaciously personalize medicine and cure a variety of diseases if we only know 22.7% of the entire human proteome at maximum? Aside from the entire proteome — diseases that have constantly garnered scientific attention such as Alzheimers involve complex protein cascades.

Ultimately, with the use of our technology, we propose a Protein Structurome project, which, with the help of advances in genomics, metabolomics, transcriptomics, and all of the other sub -omics fields, holds the potential to revolutionize the entire field of medicine and save millions of lives. By mapping out the pathways involved for particular disease modules, we can gain access to so many more therapeutic targets, predict side effects, and gain a new understanding of diseases.

Researchers had mapped the entire human proteome (almost) half a decade ago — yet we don’t know the structure of the majority.

Before we introduce our new approach, we’ll review traditional methods for protein structure determination and where they fall through.

Single Protein Imaging Techniques

Many single protein imaging techniques involve high levels of radiation, which induce high levels of radiation damage to proteins and other biomolecules, thereby disrupting their secondary, tertiary and quaternary structures. In addition, many still involve the use of singular crystals, which poses a similar problem to X-ray diffraction — proteins with intrinsically disordered regions cannot be imaged.

Other methods use heavy metal atoms, but these can coordinate with functional groups on amino acids and disrupt the three dimensional structure. Here we use gold nanoparticles that are anchored to the surface of the nanopore channels — therefore, they will not be able to coordinate or ligate with amino acid backbone or side chains.

Our Methods

In the past decade, we’ve seen an explosion in the use of nanopore technology for DNA sequencing — and even more recently protein sequencing. We think we may have found methods to exploit these same methods — except this time to explore and determine protein structures.

Nanopore sequencing works by taking in a DNA sequence, nucleotide by nucleotide, and determining the chemical identity of the nucleotide based on perturbations to the electric current flowing through the nanopore because of the flow or ions. These electric signal perturbations are unique for each amino acid and can be used to identify the exact composition of the sequence.

Companies such as Oxford Nanopore technologies have developed sequencing technologies utilizing these methods such as the ION series — small handheld devices that allow for the convenient and portable sequencing of high throughput DNA.

And even more recently, scientists have learned how to use the same technology to sequence amino acids — using properties such as the polarity, hydrophobicity, aromaticity, and size of amino acids, they’ve been able to identify protein sequencing much more conveniently based off of unique electrical signatures.

What if we could take traditional X-Ray diffraction and CryoEM machines, turning them into the size of nanopore sequencing machines that cost less than an iPhone?

How our Technology will Work

Our technology will work through a combination of nanotechnology, machine learning, and hardcore biological sequencing methods.

We propose two main mechanisms to allow for the identification of both structure and sequence based motifs. A nanopore based method will allow us to extrapolate structural motifs of the proteins through an ionic current while gold nanoparticles will allow us to deduce the single amino acid identities of the residues in a protein.

Corkscrew mechanism

We propose the use of ATP Synthase for rotation of the protein structures in the nanopore channel.

With an in-vivo random search through the possible orientations with which the protein can be inserted into the channel, the ATP synthase complex will rotate the protein structure, causing fluctuations in surface area exposed to incoming ions, thereby allowing for the determination of different secondary and tertiary structure motifs.

Through mathematical techniques such as fourier transform and reinforcement learning, we will allow for the interpretation of these signals into biologically relevant three dimensional computational structures.

Directing Proteins to the Nanopore

We propose imaging proteins in vivo in addition to ex vivo, which will allow us to accurately sample biologically active conformations rather than artificially constructed ones. With the explosion in new genomics, epigenomics, proteomics, and transcriptomics data, we believe that we can direct proteins to our nanopore site using either a nucleic acid or protein beacon that has predicted or already known binding affinity to the protein in question. Here are some ways to elucidate the composition of this molecular beacon.

  • CHIP-seq immunoprecipitation data — use data showing how proteins bind to DNA to use specific DNA sequences as potential molecular beacons
  • biochemical transduction pathways — use pre-existing knowledge about how proteins interact with each other to culture another protein with known high affinity to your protein of interest
  • If no CHIP-seq or PPI data already exists, we recommend conducting HT-SELEX analyses to search through randomly sampled sequences of RNA libraries to identify potent small length RNA sequences that can bind to your protein of interest. Tools like Aptasim can generate error-prone pcr simulations in silico and we project that by 2030, HT-SELEX will be computationally robust and feasible.

Questions and points to consider in the development of our technology:

  • Do existing mathematical techniques like Independent component analysis (ICA), SVD, and Fast Fourier Transform allow us to decompose the electrical signal perturbations into high resolution secondary structure motifs?
  • Cost of Gold nanoparticles — based on our calculations of the average surface accessible surface area and supporting stoichiometric calculations, the maximum amount of gold nanoparticles needed for sequencing one protein will likely be around one milligram — costing $80 for this area of expenditure.
  • Will our method be able to gain access to elements of structure outside of the solvent accessible surface area (especially the hydrophobic core)?
  • Determining the number of nanopores required for structure determination (current ION projects from Oxford Nanopore sequencing technologies require anywhere from 300–10000 nanopores within one sequencing flow cell. We project anywhere from 10000–1,000,000 nanopores required in tandem for high quality structure determination the single and sub angstrom level.

Software — We propose using machine learning algorithms (especially algorithms in the realm of dimensionality reduction, such as PCA) to determine the secondary structure of protein elements through clustering.

  • alpha helices
  • 3¹⁰ helices
  • beta pleated sheets

Further, we intend to utilize machine learning algorithms like RoBERTA and NLP Transformers, models which have recently acquired breakthrough levels of accuracy and precision on a large range of NLP tasks. Though primarily used in NLP, these methods have also beginning to be used in a wider variety of tasks including computational chemistry and small molecule drug design. As opposed to traditional NLP methods like LSTMs and RNNs, they capture long term relevant dependencies in data as they use something called “attention mechanisms” to observe all of the data at once.

Gold Nanoparticles

While the corkscrew mechanism allows for the detection of secondary structure motifs through perturbations in ion flow, gold nanoparticles will allow for the specific detection of different amino acids.

Researchers have recently used molecular dynamics to show that nanoparticles of the same size have different binding energies to each of the twenty amino acids. Moreover, each of the twenty amino acids has preferential binding to a nanoparticle of different diameter due to the chemical composition and size of its sidechain. By translating the differential binding affinities of each of the amino acids into fluorescent signals, we can computationally identify the amino acids composing a protein.

Each amino acid has different preferential binding levels for each of the different sized AuNPs(gold nanoparticles) as evidenced by different binding free energy levels

We recommend using the ProtSA web server to identify the average solvent accessible surface area per residue in the unfolded conformations of the protein structure. This should provide insights into the number of gold particles needed to cover the structure and the specific angle of orientation with which to insert the protein.

Implementation

Protein structure determination is a fundamental tenet of structural biology and becoming a key bottleneck aspect in the fields such as biochemistry, longevity, and machine learning. Again — our ability to unlock the key to therapeutics and understand the human code at a much deeper level extends beyond solely our genetic code (DNA and RNA) into the actual workers of the cells — the proteins — the ones that carry out all important cellular functions.

We envision a world where drug discovery doesn’t take two decades. Where vaccine development to decelerate disastrous pandemics like COVID-19 doesn’t take 1 whole year. Where we’re able to provide treatments and cures to the millions of people around the world with chronic diseases that right now have no definitive and certain cure.

Macula Therapeutix is a company founded by Kiran Mak and Mukundh Murthy, two innovators passionate about changing the world by changing one of the most traditional and unquestioned methods that dictate how to acquire structural biology data.

--

--

Mukundh Murthy
MaculaX Therapeutics

Innovator passionate about the intersection between structural biology, machine learning, and chemiinformatics. Currently @ 99andbeyond.