Purpose of this project is to share some ideas on how to work on COVID-19 data provided by Kaggle. I have referred others kernels and took what I think is useful for this project. So if you find some code somewhat familiar please excuse me, I am trying to learn from you by implementing them myself. The reference link can be found below.
Let get started!
- Data preprocessing
- Document embedding using FastText
- Extracting related documents from the user’s query
- Calculate similarity matric for top n related documents
- Building knowledge graph using the similarity vectors
- Using the degree of centrality to identify the most useful documents
All the codes are provided in a notebook at GitHub.
Permalink Dismiss GitHub is home to over 40 million developers working together to host and review code, manage…
The dataset and our goal
The dataset consists of a collection over 45,000 scholarly articles, including 33,000 texts about COVID-19, SARS-CoV-2 and related coronavirus. We are looking only at metadata.csv and we are only interested in the title and abstract column. We could use all the data provided later but for now this to test out if my approach works.
The goal of this project to provide key answers that can be found using the dataset provided. These are scientific questions which might provide hints to experts on the direction to work on when combating COVID-19 virus.
Some of these questions are:
- New diagnostic methods and products to improve clinical processes.
- Movement control strategies to prevent secondary transmission, health care and community settings
1. Data preprocessing
As usual, there will be some dirty records that need to be removed, and we extract the only title and abstract columns and merge them for easy access. As we are going to convert them into vector form for the downstream processes.
2. Document embedding
There are many ways to do this, we could use a bag of words, Word2Vec or even more advance models like ELMo or BERT. FastText is chosen because this is a test project, we want to start small and iterate fast. We can always scale up later on when we are sure this implementation works.
FastText builds on Word2Vec by learning vector representation for each word. Instead of learning vectors from words directly, FastText learn the representation of each word in n-gram of characters. This help captures the meaning of shorter words and allows embedding to learn suffixes and prefixes. Example word “coronavirus” with n-gram =5, we would have sets [coron, orona, ronav, onavi, …, virus].
FastText works well with rare words as it can break down words into smaller segments. From the above example, we would know that the word “coronavirus” is related to a virus. Word2Vec and GloVe will have difficulty represent vector for words that are not in the model dictionary. For more detail explanation refer to link below.
Embedding sentences in the document
I will be using Gensim FastText model to create my embedding. I was inspired by their implementation on “most_similar” function where they allow positive and negative text to be computed which tweak the direction of the vector. I added this functionality when embedding user’s query and this additional field will provide context on the user query.
def search_doc(query, context=None, top_n=100):
query_vector = sentence_vector(query, positive=context)
result = model.cosine_similarities(query_vector, df_vectors.values)
df['score'] = result
return df.sort_values('score', ascending=False)[:top_n]
Looking at the following example, without providing context these 2 sentences will have very high similarity. The model could not differentiate coronavirus and computer virus.
sent1 = sentence_vector(“new coronavirus affecting every household”)
sent2 = sentence_vector(“computer virus is affecting every household”)
With context, we could direct the model to better understanding the sentences. The first sentence we are talking about a biological virus and the second sentence we are talking about digital virus. This greatly reduces the similarity score.
sent1 = sentence_vector(“new coronavirus affecting every household”, positive=”biology”)
sent2 = sentence_vector(“computer virus is affecting every household”, positive=”digital”)
I use Spacy to split the document into sentences and then tokenize them into words. The word vectors are then averaged and the result will be our document vector (Average word embedding). The documents from dataset do not need to add context because they are not a single sentence, they have a lot of keywords in them to provide context.
3. Extracting related documents
After we have all the embeddings, we can now extract documents that are related to our user’s query. We do this by using “cosine_similarities” comparing all our document vectors against the user’s query vector. This will give us a similarity score, 1 is the highest similarity and -1 is the least similar. Next, we will sort the result and top n will be selected.
We are doing this step because we want to build a knowledge graph using only the relevant documents.
4. Calculate similarly matric for top n related documents
The idea behind this is to form connections for our knowledge graph. By using similarly matric, similar documents will form an edge between each other. You might ask isn’t all document are already similar at start, and every document will form an edge with one another. Yes, that is true, which is why we need to do a min-max normalization to our matric. This will increase the difference between each document and the only document which are very closely related will form edges. Finally, we will filter off documents with similarity score lower than 0.8, this threshold can be tweak and it will determine the number of edges you will work on later.
Take note that our final matric will be n x n matric if you find your machine is too slow when computing reduce the n will help.
5. Building knowledge graph using the similarity vectors
With our edges ready, we will then form our knowledge graph. For this project, I am looking for clusters in my knowledge graph. With filter at 0.8, I’m able to obtain an average degree of 12. After few around of testing, I also realize that average degree in range of 10–20 produce better result. The following show our result.
Number of nodes: 710
Number of edges: 4310
Average degree: 12.1408
These clusters show how documents are related to one another. We can see there are some outliers and our centre cluster contain the most important information for our query. We can use different centrality algorithm to extract top n document and recommend to our users. The following are a short explanation of what different centrality does.
Degree centrality: Measures the number of incoming connections. Can be interpreted as popularity nodes, top n nodes with are the ones with the most connection.
Closeness centrality: Measures how quickly (minimum number of steps) can one node connect to others in the network. Can be interpreted as centralness nodes.
EigenVector centrality: Measures node connectivity to those who are highly connected. Interpret as influencing nodes and nodes that exercising control behind the scenes.
Betweenness centrality: Measure a node that acts as a bridging node to the others.
6. Using the degree of centrality to identify the most useful documents
Top 3 results from our knowledge graph are shown below. The result looks good it is answering what user query ask for.
User query: New diagnostic methods and products to improve clinical processes.
Context: medical, pneumonia, fast, test kit
Doc1: Advances in the diagnosis of respiratory virus infections.
Background: Advances have been made in selecting sensitive cell lines for isolation, in early detection of respiratory virus growth in cells by rapid culture assays, in production of monoclonal antibodies to improve many tests such as immunofluorescence detection of virus antigens in nasopharyngeal aspirates, in highly sensitive antigen detections by time-resolved fluoroimmunoassays (TR-FIAs) and biotin-enzyme immunoassays (BIOTH-E), and, finally, in the polymerase chain reaction (PCR) detection of respiratory virus DNA or RNA in clinical specimens...Doc2: Diagnostic Techniques: Microarrays.
Current techniques for viral detection and discovery, which include culture and serological methods as well as polymer chain reaction (PCR)-based protocols, possess a variety of inherent limitations. In an effort to augment the capabilities of existing diagnostic methodologies, the use of virus-specific DNA microarray technology has been recently applied in both research and clinical settings with favorable results. The primary advantage of this approach is that DNA microarrays containing literally thousands of virus-specific sequences allow simultaneous testing for essentially all known viral species...Doc3: Modernising epidemic science: enabling patient-centred research during epidemics.
BACKGROUND: Emerging and epidemic infectious disease outbreaks are a significant public health problem and global health security threat. As an outbreak begins, epidemiological investigations and traditional public health responses are generally mounted very quickly. However, patient-centred research is usually not prioritised when planning and enacting the response. Instead, the clinical research response occurs subsequent to and separate from the public health response, and is inadequate for evidence-based decision-making at the bedside or in the offices of public health policymakers...
The first two documents show us ways to do testing, interestingly third document highlight about modernizing clinical research. Which is true in some sense, if you have an old research facility you don’t even need to think about diagnosing. Our society often priorities the wrong things and neglect what is given to us.
Next, we will look at the next query to see it work as expected. Note that this time around I am using Eigen Vector Centrality. Purpose is to explore to see if this gives us a good result.
And again top 3 results from our knowledge graph are shown below.
User query: Movement control strategies to prevent secondary transmission, health care and community settings.
Context: pneumonia, medical
Doc1: Practical recommendations for critical care and anesthesiology teams caring for novel coronavirus (2019-nCoV) patients.
This paper summarizes important considerations regarding patient screening, environmental controls, personal protective equipment, resuscitation measures (including intubation), and critical care unit operations planning as we prepare for the possibility of new imported cases or local outbreaks of 2019-nCoV. Although understanding of the 2019-nCoV virus is evolving, lessons learned from prior infectious disease challenges such as Severe Acute Respiratory Syndrome will hopefully improve our state of readiness regardless of the number of cases we eventually manage in Canada.Doc2: Experiences and challenges in the health protection of medical teams in the Chinese Ebola treatment center, Liberia: a qualitative study.
BACKGROUND: Health care workers are at the frontline in the fight against infectious disease, and as a result are at a high risk of infection. During the 2014–2015 Ebola outbreak in West Africa, many health care workers contracted Ebola, some fatally. However, no members of the Chinese Anti-Ebola medical team, deployed to provide vital medical care in Liberia were infected. This study aims to understand how this zero infection rate was achieved. METHODS: Data was collected through 15 in-depth interviews with participants from the People’s Liberation Army of China medical team which operated the Chinese Ebola Treatment Center from October 2014 to January 2015 in Liberia. Data were analysed using systematic framework analysis.Doc3: Timely mental health care for the 2019 novel coronavirus outbreak is urgently needed.
he emergence of the 2019-nCoV pneumonia has parallels with the 2003 outbreak of severe acute respiratory syndrome (SARS), which was caused by another coronavirus that killed 349 of 5327 patients with confirmed infection in China.3 Although the diseases have different clinical presentations, the infectious cause, epidemiological features, fast transmission pattern, and insufficient preparedness of health authorities to address the outbreaks are similar. So far, mental health care for the patients and health professionals directly affected by the 2019-nCoV epidemic has been under-addressed, although the National Health Commission of China released the notification of basic principles for emergency psychological crisis interventions for the 2019-nCoV pneumonia on Jan 26, 2020.
Document 1 shows the need for screening, environmental control (I believe should be quarantine?), protective equipment is essential to prevent transmission. What surprised me is that document 3 shows us that the mental health of health care professional is also being affected. Although health care professional often has seen many death in their career, such a high number of death and consistently decide who to live or die is having a huge blow to their mental health. We can imagine how would not facing this mountain load of stress not affect their job in hospital. So let salute to these front line heroes and their selfless contribution. We should also do our part flattening the curve to reduce the load for our health care system.
Thank you for reading until the end. This is my first post and might have errors in my implementation. Do correct me if you find any mistakes. Let hope all of us will get through this pandemic and continue our daily life.
P.S: Please leave your house only if you have to and remember to wash your hands. Stay safe.
COVID-19 Open Research Dataset Challenge (CORD-19)
An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House
CORD : Tools and Knowledge graphs
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources
Beyond Word Embeddings Part 2- Word Vectors & NLP Modeling from BoW to BERT
A primer in the neural nlp model archticture and word representation.