How a Major Fault of the Drug Trial Industry can be Fixed With Machine Learning

6 min readSep 4, 2022

Hundreds of years of studying biology has taught us many things about how the human body works, and how cells comprise it. The drug industry has done a great job of optimizing how nutrients interact with cells using our current knowledge of biology, but as it turns out, we don’t quite know the full story yet. Our ‘textbook’ definition of the eukaryotic cell membrane has fallen apart in recent years with the discovery of thousands of new proteinaceous transporters; we used to believe that small molecules could simply diffuse through the phospholipid bilayer without any empirical evidence, but recent studies suggest that osmosis transport as such is actually negligible. We have only just stumbled upon the reality that the transport of solutes and nutrients into a cell is done by the approximate 42 million protein molecules that make up the phospholipid bilayer, and what we do with that information is very important for the future of the drug industry.

What we thought comprised the phospholipid bilayer:

What actually comprises the phospholipid bilayer:

Drug testing used to rely on arbitrary diffusion through the phospholipid bilayer and used to test how it would happen on average by testing many times on many people, however with the discovery that proteins do the transporting, it turns out to be more genetic than arbitrary. With a fully sequenced human genome coming approximately in the next 5 to 10 years, and a subsequent complete proteome, drug testing can be based on inference of an individual’s proteome rather than guessing off of a testing range. Using an individual’s primary sequences from a full genome, along with proteome information about how different proteins function, can unlock information about how many of each type of protein comprises the phospholipid bilayer, and subsequently how transport of drugs will happen for specific cells. Far too often will a person’s cells take in too much of a particular drug causing toxic effects, or take in not enough of a life-saving drug to even make a difference, all because we are not specifying drug doses to an individual’s genetics. Basing a dose on an individual’s drug intake characteristics, which is based on genetics, can not only make taking drugs safer, but it can make the process of testing and giving doses much easier.

How it works:

All proteins made for cell transport in the human body are encoded by our genes, and because genes are what make the blueprint of life, we can look into the specific part of people’s “blueprint” which is their proteome for information about what proteins will transport what drugs. Proteins by their nature are unstable and will bind to specific compounds that they are supposed to move in order to move them. We can tell by the protein’s very chemistry what drugs it will affect, but the more important part is knowing how many of said protein there is in an individual.

Machine Learning for Optimizing Drug Dosage:

To be able to infer how an individual’s cells will intake drugs along with other nutrients, it first has to be known what is carrying said drugs, and how many of those protein molecules there are in the cell’s phospholipid bilayer of its membrane. Because there are an estimated 400,000 different types of proteins that make up the proteome, that are produced by around 8,000 different genes, it can be hard for conventional knowledge to pinpoint how many drug inhibiting or encouraging proteins exactly are produced by an individual’s genes, but that is where machine learning comes in; machine learning methods such as SVM’s make use of the kernel trick to separate datasets in ways that are not normally explored by humans. The kernel trick basically projects datasets into a higher dimension to be able to separate them into separate sets for learning:

Data can be separated conventionally along a linear line to determine what something is by what side of the line it falls on, but for something like genes that cause many different proteins that may or may not inhibit or solicit drug intake, that data is not able to be linearly separated on a 2D plane, so the kernel trick separates it in a 3D plane.

The decision surface, also called the hyperplane, divides the dataset in a way that it can distinguish a new input’s position, and therefore what it is most likely to be. The kernel trick works on data frames with more than two dimensions already, which is why it makes a very good choice for high dimension machine learning with good computing power.

Diagram of Defuse Lab’s SVM, a leader in the idea

The kernel trick is especially useful for a data frame of proteins based on the genes that caused them because it can be used to determine the overall balance of them. For example, the primary sequence of an individual could say that they have gene: ABC which is a gene that causes protein X, EFG is a gene that causes protein Y, and HIJ is a gene that causes protein Z. The proteome says that both protein Y and protein X inhibit the drug, but protein Z encourages the drug, therefore drug uptake may still happen but it could be hindered at a rate of about 33% if the proteins X, Y, and Z equally make up this cell’s phospholipid bilayer. But because there are millions of more proteins that make up the cell’s proteome, the SVM would take all of the thousands of genes that cause the thousands of proteins like Z, and all of the thousands of genes that cause thousands of proteins like X and Y and look for a major outlier through a hyperplane. In other words, the SVM can use easy-to-obtain information about the proteome that can be obtained through simple diagnostics to be able to infer the more complicated parts of the proteome. For example, aquaporin-2 which is a protein responsible for the transport of water is encoded by the AQP-2 gene in the kidneys, which means that if aquaporin-2 is found in the bloodstream during a total protein blood test, then we can deduce that the AQP-2 gene is active and that all proteins encoded by it will take a part in the proteome.

Why this is useful:

The drug screening and testing industry is a $14 billion USD (2022) market, all for testing in ways that hopefully apply to a broad range of people. It takes about 15 years for a drug to reach the market from testing, and even then only about 90% of drugs make it through the four phases of clinical trials; with 59% of said failures failing once they reached human trials. There is a disconnect between how we actually test for drugs, and what causes drugs to actually fail within the body. 59% of drugs do not work the way they are expected to once within the body, because protein transport is not taken into account yet, and this only inflates the 1.27 Trillion USD (2020) pharmaceutical industry. In the close future, it will be time to start having modern-day biology dictate how drugs are tested on humans by simplifying it down to the actual science. Not only will this dismantle the bloated drug testing industry, and the over-inflated pharmaceutical industry, it will make drug testing faster and more consistently safe for the consumer to ensure that the industries stop profiting off of people’s suffering.

How a Major Fault of the Drug Trial Industry can be Fixed With Machine Learning

Sources:

Written by Jacob Appleton

How a Major Fault of the Drug Trial Industry can be Fixed With Machine Learning

​

Sources:

Written by Jacob Appleton