NLP on Cell Biology Abstracts: Comparison of NMF and Word2Vec methods

Published in

Beth Blog

5 min readNov 16, 2020

Abstract centroid clustering into signaling pathways

For the natural language processing project in my data science bootcamp, Metis, I decided to go back to my favorite reading material from grad school — cell biology paper abstracts. An abstract is a 200–500 word summary of a research article, and includes all its salient points and conclusions. Abstracts are also freely available from PubMed and can be accessed in bulk using NCBI’s EUtilities. I tried to create a highly specific corpus on cell signaling pathways by collecting abstracts that matched the MESH term “Cell Physiological Processes”, were in English, were about humans, and matched a search for ‘signaling’ or ‘pathway’. I choose this approach to avoid collecting a large number studies on disease states which may have dysregulated signaling pathways and clinical trial studies. In total there were just over 300,000 abstracts spanning 1985–2020 matching this description.

My goal here was to see how the cell biology literature natually segments into signaling pathway topics. Signaling pathways in biology are defined as networks of proteins that work together to create a specific effect on a cell. However, proteins in a real cell create one giant network, where everything effects everything else in some small way. So, how can we seperate this network into discrete pathways? I think that the proteins commonly studied together in scientific papers form the foundation of the concept of pathways, and NLP may be able to help explore this.

Signaling Pathway Examples from Wikipedia

I found a great python module called SciSpacy that integrates with a popular NLP tokenization module called Spacy. Using SciSpacys named-entity recognition (NER) tagger I could tag biological terms, or tokens, from the abstracts and chunk them together. For topic modeling, I tested all using all document nouns, all biological named entities, and just the “gene or gene product”. I found that using only the tokens tagged as “gene or gene product” by SciSpacy’s NER produced the cleanest and most recognizable topics. As well I tested the commonly used topic modeling techniques on the TF-IDF matrix of token frequencies: LSA, LDA and NMF. NMF or non-negative matrix factorization by far performed the best out of these three for this this application. For each of the 25 latent topics, the top tokens immediately struck me as members of a ‘canonical’ signaling pathways, such as p53, Wnt, integrin signaling, ect.

I used K-means clustering on the 25-dimensional NMF space to see if I could squeeze more topics out of it (there are actually hundreds of pathways listed in manually annotated databases such as Reactome). Based on an inertia plot I was only able to really get 5 more distinct topics out of clustering, for a final of 30. The additional step of K-means allowed me to separate Wnt from TGF-beta and p53 from cell-cycle, which are related pathways that NMF put together in the same topic. There were also some clusters that were a bit less obvious to me, like TNF/serotonin and AR/NFkB. The biggest cluster was ROS or reactive oxygen species. I created token lists for each cluster and looked at their representation (number of token mentions relative to the number of documents). For the ROS cluster the top token was Nrf-2, a transcription factor that responds to oxygen stress. But its representation in this cluster was much much lower than the top tokens in other clusters, suggesting that the ROS group may still be very diverse.

Next, I tried training a small Word2Vec model on my documents using the Gensim library. This is a neural network that learns about word context from neighboring tokens and creates a representation of the word as a vector. Similarly used words will have similar vectors. I input all the sentences from my corpus split into tokens, excluding stop words and keeping the chunking of the NER-tagged entities. The model was a CBOW model with a window size of 5 and vector size of 30. The results were a lot better than I expected for such a small model. The most similar tokens to any protein I tested were its aliases and alternate spellings. I also replaced the standard Word2Vec arithmetic test, king minus man plus woman equals queen, with Smad minus Tgf-beta plus STAT equals… janus kinase! Pretty cool. It also did well answering many other analogies about binding partners, receptors, and inhibitors. The word embedding plots also showed word positioning that retained information about protein-relationships.

Word2Vec embeddings for kinase cascade pairs

Word2Vec embeddings for receptor ligand pairs

Documents can be represented as vectors as well using the Doc2Vec model. But they can also be represented using the centroid technique, which is the norm of the sum of the word vectors of words in the document. I used the centroid technique on the abstracts, but just using the gene or gene product tokens from each abstract, because that worked better before. Then I used K-means clustering again on the abstracts centroids, which were in a 30-dimensional space. I found that I could get about 75 clusters out of the abstracts. I also trained a Word2Vec with a larger window and vector size of 300 that performed better on similarity tasks, but wasn’t great for K-means clustering. When I tried to determine the optimal number of clusters with an elbow/inertia curve, it suggested a really small number of clusters. I think that the dimensionality was maybe too high.

Labeling the 75 clusters from the abstract centroids was a daunting task, but I felt motivated when I looked at the top represented tokens from each cluster. I could still easily assign them into signaling pathways, and these pathways were more nuanced versions of those from the NMF method. Surprisingly there were also clusters containing just dates, names and places. These were bad documents of author lists that had somehow snuck into my corpus. I was pretty impressed with the outcome from the abstract centroid clustering using my small Word2Vec model.

Most of the cell biology pathways I’m familiar with were represented using this method. There was one cluster that I couldn’t definitively label, as it didn’t have any strongly represented token. However, one of the top ten tokens was the name of the enigmatic family of proteins I studied in my biology post-doc, piwi! Of course.

Check out code and model on Github: https://github.com/Beth526/metis_project_4

NLP on Cell Biology Abstracts: Comparison of NMF and Word2Vec methods

Written by Bethany Baumann