derp

My First Kaggle Competition

Bethany Baumann
Beth Blog
Published in
6 min readSep 29, 2020

--

I am definitely a beginner at machine learning, but I have heard a lot about Kaggle.com, a website that hosts data science competitions. Naturally, I went there and typed in “biology” to see what kind of data sets were available to play with. A competition being hosted by OpenVaccine for predicting the reactivity of bases in messenger RNA (mRNA) sequences came up, and I thought I’d give it go. OpenVaccine is associated with Eterna, which is an site where anyone can ‘play video games’ with RNA to solve biological problems. I’ll have to check that out later, it sounds like a lot more fun.

The reason mRNA stability has anything to do with vaccines is that mRNA-based vaccines are an emerging technology. A few companies have mRNA vaccines for COVID-19 in late phase clinical trials, such as Moderna and Pfizer with BioNTech.

Conventional vaccines use proteins or protein fragment antigens from a virus to invoke an immune response. Producing these requires live virus (which is inactivated before the product reaches patients), and a production vector, such as mamalian cells or chicken eggs that produce large quantities of viral proteins. This production method is difficult to scale up, takes a lot of time, and shipping the final product is expensive. On the other hand, mRNA vaccines use a person’s own cells to produce the viral protein antigen. The mRNA contains instructions for the protein, and cells around the vaccine site will translate it and signal to the immune system that an invader is present. Unlike a virus, the mRNA can’t replicate on it’s own, and will degrade in the body. Producing mRNA vaccines does not require as much specialized equipment as conventional vaccines, and they can be produced quickly when needed, at sites all over the world.

RNA is very prone to degradation, but internal binding of bases to eachother can help make an RNA sequence more stable. Since multiple mRNA sequences can make the same protein (due to alternative codon usage), there is some wiggle-room in designing mRNAs. I think the purpose of this OpenVaccine challenge is to create a model that identifies the bases that are the weakest links in mRNA molecules.

The competition asked for base-level predictions of 3 degradation-related continuous targets:

  1. reactivity
  2. degradation at 50C in the presence of Mg
  3. degradation at pH10 and 50C in the presence of Mg

You’re asked to predict these using features provided for each RNA:

  1. sequence (string of bases, or the ACTGs)
  2. length (number of bases)
  3. structure (representation of the 3D structure of the RNA from the RNAFold webserver, a string containing ‘(‘, ‘)’ and ‘.’, where ‘.’ represents unpaired bases and ‘(‘ and ‘)’ represent bases bound to each other on the upward and downward side of a loop, respectively)
  4. type of loop structure (this is also from RNAFold, and it is a string of characters annotating each base as part of a stem, internal loop, bulge, external loop, ect.)
  5. error in the measurement of targets
An example RNA sequence and its structure generated from RNAFold
The secondary structure of the same RNA as above with type of loop strucuture labeled

Once you make a Kaggle account, and agree to the competition rules, you can download the data. You can also look at other people’s code notebooks if they share them, but I didn’t look at anything until I was done because this experience was more about my learning than about getting a good score. I told myself I’d look if I got stuck. One thing I wish I had known in advance was that pandas has functionality to directly read .json files! You can learn a lot from other people’s code.

The first thing I did was try to engineer some new features I thought might be important. From my years of training in biology (which turn out to be way less important in this particular application than my month of training in data science…), I thought that GC content, loop structure length, total number of loops, base-pairing (G pairs with U as well as C in RNA), and the bases and loop structure types directly next to each base might be important.

Correlation heatmap matrix of the continous features

Looking at the correlation between the variables, we can see that the target variables (deg50_*, and reactivity) are strongly correlated with each other. Total GC content has a correlation with the target variable errors and with total number of loops. Structure type length and total number of loops were strongly negatively correlated, which makes sense — as you have more loops structures they must get shorter.

None of the other new or old features were really correlated with each other, which is good for most data science models, but… at the same time none of them were correlated with the targets. Maybe all the small correlations will add up to something useful? Probably not.

The error features were highly correlated, and I decided to remove bases with high errors (about 7% of the data) from the dataset before making any more graphs or models.

I also plotted some of the categorical variables vs. the target deg50_Mg. I think that higher deg50_Mg scores are less stable, but I don’t know the units or technique of measurement. It seems like sequence context has a mild relationship, with G’s in a GGA context having the worst stability on average.

Effect of base context on degradation

Unfortunately for me, since I spent a long time figuring out how to make a function to extract base pairings, base-pairing didn’t really matter as much as whether or not the base was paired at all (’N’ is unpaired in the graph below).

Effect of specific base-pairing partners on base degradation

I thought I had made some useful features, but in the end none seemed to be at all useful for predicting degradation. I was disappointed, and I wanted to finish this project up, so I decided to just use a boosted regression tree HistGradientBoostingRegressor from scikitlearn to create my predictive models. I had read that gradient boosted regression trees usually perform pretty well because of their ensemble-like nature. When I broke out the Kaggle training set into my own test and training sets, this type of model had a good RMSE of around 0.002 for predicting deg50_Mg, and I was so excited for a minute because the top competitor on the leaderboard had a 0.2 average RMSE in the private test set on Kaggle.

But, I think there must be something different about the private test set, because when I uploaded my predictions I only scored 0.3-something and wasn’t even in the top 1,000! Maybe we are all overfitting to the training set.

Conclusion

I was not a contender, but I put myself out there and that’s all that matters.

If you are interested in my notebook check out https://www.kaggle.com/bethbaumann

--

--