Utilizing Multiple Types of Features for Protein Classification!
Most of the current and standard approaches utilized for the endeavor of classifying proteins rely on recurrent neural networks being used on protein sequences; these networks then predict protein classification. However, this particular project was done with the intent of trying to see if we could use other easily attainable features to aid with prediction. The results suggest that there is plentiful room to improve.
Data used: https://www.kaggle.com/shahir/protein-data-set
Now then, how about we jump right in?
We begin by fooling around with the data a little bit. In order to do this, we first need to actually, well, get the data. Kaggle is great about this, as there’s no need for us to actually download the datasets hosted on its website to our local machines in order to use them. Instead, since it serves as both an online notebook editor and a community for sharing anything about data science(including datasets), we can go to the above link and just do this:
With the preliminaries out of the way, we can finally start coding!
Much of the code for the initial data cleaning process is directly taken from the notebook linked in the reference section of this article.
First thing we should is read the data into pandas DataFrames. You’ll notice that the data comes in two seperate .csv files.
Now let’s merge these two DataFrames into a single one!
After this, we just do some basic data exploration and cleaning and that sort of thing. We’ll zoom through the process in this article, but if you want to see the details, make sure to check out the full notebook. You can find it on my GitHub, which is linked below!
First, we drop all null values. Then, we make sure to get rid of all rows that deal with macro molecules that aren’t proteins.
Some interesting findings below:
Next, let’s merge aaallll of the data into one single feature!
Next, it’s time to prepare the data for our final model!
Remember to import the tools we’ll need.
And now, it’s time for what you’ve all been waiting for…. The actual model!
Time to train it!
Okay, now we need to figure out how well our model performs.
The grand finale! Our confusion matrix…
Welp. As you can see, our model struggles a lot. Still, I hope you guys learned something from this!
The full notebook can be found on my GitHub: https://github.com/AAbhi256
References
https://snap.stanford.edu/snappy/doc/reference/multimodal.html
https://openaccess.thecvf.com/content_cvpr_2017/papers/Chen_AMC_Attention_guided_CVPR_2017_paper.pdf