3-dimensional protein structures such as this one are common subjects in bioinformatic analyses. (wikipedia.org)

We need more computer scientists in biology and biomedical research

Lior Zimmerman
5 min readApr 6, 2019

I have just returned from the AACR — one of the biggest cancer research conferences in the world. It attracts more than 18,000 participants from around the world and has been described as the “main forum to present and discuss cancer-related research”.

The AACR conference features talks and research posters in a plethora of domains — from ongoing clinical trials, novel drugs in various stages of development and basic research.

There were also talks that involved computational research projects. One notable example from this year is a great talk by Tommy Boucher, a computer scientist from Gritstone Oncology, that described a deep neural network for the prediction of T-cell neoantigens from mass spectrometry data of MHC-II — peptides. Previous studies of this groups were published in Nature Biotechnology

If the title of this talk reads like gibberish to you, you are certainly not alone.

Computer scientists with solid biological background are numerous. With no CS expertise available, biomedical and molecular biology researchers often try to approach problems with limited CS background at their disposal.

According to a survey conducted by the Society for Experimental Biology (SEB), in association with members of GOBLET (Global Organisation for Bioinformatics Learning, Education & Training), From >200 scientists who responded to the survey, the results showed that 76% were self-taught various CS and bioinformatics skills, or relied on colleagues to help them use bioinformatics tools and resources; 67% felt that data analysis and interpretation was the area in which training was most needed; 74% suggested that extra training should be delivered via hands-on workshops.

Those findings are sadly, consistent with my personal experience. At the AACR, I have listened to several talks and watched several posters where the computational part was poorly executed. From bad choices for inference models (Deep Learning For 20 samples?) to statistical blunders (having p-value > 0.05 doesn’t mean the null hypothesis is accepted!).

Indeed, scientists have voiced the alarm that many machine learning papers published in various fields of science are not reproducible.

Often, computer scientists are involved in the research. However, in many cases they miss a proper biological understanding of the problem. For example, many won’t know that a large next-gen sequencing file can be clustered by genes and pathways, genes could be looked at from several perspectives — protein structure, localization in the cell, interaction network and more.

Many choose deep neural networks as a model for their scientific question, perhaps because it is a model that “works out of the box” and learns the set predictive features by itself. However, in many cases, a much simpler model could have been chosen if the problem was put in the proper context.

Data in biomedical and molecular biology research is expensive to generate, while images of cats are numbered in trillions, images of osteosarcoma pathology slides coupled with the molecular profiles of the tumors, are much less available, and when data is sparse, domain expert knowledge and feature engineering capabilities are extremely valuable.

Computer scientists should be involved in the R&D process from its very beginning, not just as sophisticated data analysts. Starting from the design of the wet-lab experiments — where the data is being generated. This is because we, as computer scientists understand that data is the most important resource. Your model, whether it is an incredibly sophisticated CGAN or a logistic regression is as good as your data.

As an example for that point, in one of my previous projects, my team was asked to build a model for predicting antibody binding to a series of proteins. We had a high-throughput system that allowed us to generate a data set for 1,000,000 different antibodies, which will be used for training a classifier. The biologists in the team, compiled a list of antibodies who are known from published literature to bind the series of proteins we were interested in, and inserted 1–2 amino acid alterations in various positions (minor changes) to generate close to a million variants that will be tested in the high-throughput system.

This is similar to using a data set of images of poodles of different colors when your purpose is training a classifier that predicts whether an animal is a dog or not.

They were indeed trying to solve the problem at hand, by starting from antibodies that are known to bind the molecules we were interested in and inserting minor changes to find “novel” antibodies.

Clearly, that is not how a computer scientist would have approached the problem. In order for a model to be able to generalize, we need a data set that spans a much larger portion of the antibody sequence and structure space, the data set needs to be much more diverse.

Eventually, joining the project in a late stage, I had little left to do to save this project. The experiments were already finished, the data was already generated, costing more than $100k.

To sum up, It seems that biomedical and molecular biology research is in urgent need for computer scientists.

Certainly, the transition can be from the opposite direction, biomedical and molecular biology researchers can and should acquire proper computer science education, as the vast majority of them are incredibly capable.

You, as a skilled computer scientist or a CS student should consider joining us, to computational biology research. Not only you will work on incredibly interesting problems (selling ads is not one of them) you will also contribute to the health and well-being of billions of people.

There are also plenty of positions that are available and the demand is increasing.

📝 Read this story later in Journal.

🌎 Wake up every Sunday morning to the week’s most noteworthy stories in Society waiting in your inbox. Read the Noteworthy in Society newsletter.

--

--

Lior Zimmerman

Computational Biologist, Head of Protein Design @ Enzymit