COVID-19: what can we do with NLP?

Thiago Dantas
Voice Tech Podcast
Published in
7 min readMar 25, 2020

On the last few weeks, the world has been struggling against a new menace. Since the outbreak of the new coronavirus, I was wondering if there was anything I could do to help in the fight against COVID-19.

Photo by Miguel Á. Padriñán from Pexels

So I went to kaggle to see if there was a competition related to the the new corona virus and it turns out that there is.

The competition issues a call to action to AI experts to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions. For this purpose, the competition provides a dataset, the COVID-19 Open Research Dataset (CORD-19), which is a resource of over 44,000 scholarly articles about COVID-19, SARS-CoV-2 and related coronaviruses.

In this post I’ll explain my humble take on this challenge. It is a really simple solution but I’m impressed how it yields such nice results. Hope you guys enjoy it!

If you want to play around with the web-app just go to this link and have fun!

0. The Core Idea

Imagine you are researching about the novel coronavirus and you’re trying to understand more about the range of incubation periods for the disease in humans. We know that the dataset the competition provides has over 44,000 scholarly articles. How should one select only the articles that might talk about our subject of interest?

My idea is to make a search based on keywords. What are the keywords for the phrase “range of incubation periods for the disease in humans”? Maybe “range”, “incubation”, “period”, “human” are good choices. Does it make sense to you that the articles I’m interest might also have these keywords? So if I manage to extract keywords from the articles and return only the articles that have the keywords I’m looking for, I may have solved the problem, right? The researcher now has a lower number of articles to look for than before and can spend more time reading and understanding this papers.

I’ll show you a sneak peak of the final application so that you can see clearly what I’m trying to do. You can check the first article returned by the application on this link. Not by chance, it heavily discusses the incubation period.

1. How do you automatically extract keywords from your the texts?

This is the most “data scientific” part of this post so the terms I’m using may not be known to all readers but I’ll try to make everyone get the core ideas.

The first thing to do is to get all your texts in a suitable format so that you can vectorize your texts (to vectorize the texts is basically to represent them in a vectorial way that a machine learning model can understand). A common approach in Natural Language Processing (NLP) is to delete numbers, punctuation, special characters, words with no semantic meaning (ex: the, and, that…) and to convert all words lower case. This is a way to reduce the amount of information and noise your models has to deal with.

With all this preprocessing done, now its time to vectorize your texts. A common approach to vectorize texts is called Bag of Words.

1.1. What is Bag of Words?

Imagine that these texts are your data:

1. john is really happy today
2. john is not that happy today
3. amanda is a happy person and john is not a happy person

The vocabulary, aka all the know words, is [john, is, really, happy, today, not, that, amanda, a, person]. So we can represent these texts as the amount of times each word appears. I think the best way to understand this is with an example:

   john is really happy today not that amanda  a  person
1.[ 1 1 1 1 1 0 0 0 0 0 ]
2.[ 1 1 0 1 1 1 1 0 0 0 ]
3.[ 0 2 0 2 0 1 0 1 2 2 ]

In the first sentence, as the word “john” appears only once, the vector representation of this sentence has a “1” in the position that refers to the word “john”. In the third sentence, as the word “person” appears twice, the vector representation of this sentence has a “2” in the position that refers to the word “person”.

You can easily see that this kind of text representation is really simplistic and cannot encode the sequence characteristics of the text. That’s why its called a Bag of Words.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

1.2. Getting a Better Representation with TF-IDF

As we’ve seen, the Bag of Words approach tries to represent the text using the frequency of occurrence of each word in the text. Now I want to introduce the Term Frequency-Inverse Document Frequency (TF-IDF) normalization that can lead us to better results.

We’re working with a dataset that has over 44,000 articles that talk mainly about viruses. Do you agree with me that the word “virus” probably will appear lots of times in all the 44,000 articles? Thus, do you agree that maybe the word “virus” is not so good to distinguish one article from the other?

The goal of this session of the post is to show you how to extract the keywords from the texts. One could try to do so by getting the 5 words that occurred the most in each text which is equivalent to get the 5 words with the largest “score” in the bag of words vectors. But if all articles talk about viruses, probably the word “virus” would be a keyword for all articles because it appears a lot in all articles and that doesn’t really help us. So I don’t think that the Bag of Words approach will be enough for our keywords extraction.

TF-IDF is a way to give more importance to the words a specific text uses a lot in comparison to the others. The formula is:

wherein TF(i,j) is number of occurrences of the word “i” in the text “j” and df(i) is the number of documents containing the word “i”. So, if a word occurs a lot in one specific text but not so much in the other texts, it will get a high score. If a word occurs a lot in all texts, it will get a low score.

So, as one could imagine, after doing this normalization, the words with the highest score for a specific text are the words that “best” characterize that text.

Now, if we have an article that talks about risk factors hopefully the words “risk” and “factor” will have a high TF-IDF score. So after applying TF-IDF normalization we could get the 5 words with the highest score and say those are the keywords for that text.

2. How to search based on keywords

To recap, our objective is to, given a set of user defined keywords, get a set of articles that have those user defined keywords. What I’ve done is just a loop that checks for each given keyword and for each text if that keyword is in the text’s set of keywords.

There’s really no big deal with the search.

3. Some search results

In this search, I want to get papers that talk about risk factors related to the SARS kind of virus. In the following picture, you can look at the title of the papers we’ve got:

Now, I want to get articles that talk about asymptomatic shedding and transmission related to the SARS kind of virus. Take a look at the highlighted title.

Finally, I want to get articles that talk about immune response and immunity but no only about the SARS kind of virus.

Conclusion

In this post, I talked about my approach to searching through a big set of articles and how to get the articles you need easily and fast.

If you think a little about what I wrote here, you might be thinking that probably this approach could work well for other datasets, right? And it does. I applied this approach with two different datasets of Brazilian legal texts and the results are very similar.

If you want to see my kaggle kernel with all the code to get these results, you can refer to this link. You can check more about what I’m doing by visiting my github and get in touch through LinkedIn.

Thank you for reading!

Something just for you

--

--