Building Knowledge Graph for COVID19

Yiwen Lai
Yiwen Lai
Mar 31, 2020 · 10 min read

Purpose of this project is to share some ideas on how to work on COVID-19 data provided by Kaggle. I have referred others kernels and took what I think is useful for this project. So if you find some code somewhat familiar please excuse me, I am trying to learn from you by implementing them myself. The reference link can be found below.

Let get started!

Photo by Pixabay from Pexels

Quick Summary

  1. Document embedding using FastText
  2. Extracting related documents from the user’s query
  3. Calculate similarity matric for top n related documents
  4. Building knowledge graph using the similarity vectors
  5. Using the degree of centrality to identify the most useful documents

All the codes are provided in a notebook at GitHub.

The dataset and our goal

The goal of this project to provide key answers that can be found using the dataset provided. These are scientific questions which might provide hints to experts on the direction to work on when combating COVID-19 virus.

Some of these questions are:

  • New diagnostic methods and products to improve clinical processes.
  • Movement control strategies to prevent secondary transmission, health care and community settings

1. Data preprocessing

2. Document embedding


FastText builds on Word2Vec by learning vector representation for each word. Instead of learning vectors from words directly, FastText learn the representation of each word in n-gram of characters. This help captures the meaning of shorter words and allows embedding to learn suffixes and prefixes. Example word “coronavirus” with n-gram =5, we would have sets [coron, orona, ronav, onavi, …, virus].

FastText works well with rare words as it can break down words into smaller segments. From the above example, we would know that the word “coronavirus” is related to a virus. Word2Vec and GloVe will have difficulty represent vector for words that are not in the model dictionary. For more detail explanation refer to link below.

Embedding sentences in the document

I will be using Gensim FastText model to create my embedding. I was inspired by their implementation on “most_similar” function where they allow positive and negative text to be computed which tweak the direction of the vector. I added this functionality when embedding user’s query and this additional field will provide context on the user query.

def search_doc(query, context=None, top_n=100):
query_vector = sentence_vector(query, positive=context)
result = model.cosine_similarities(query_vector, df_vectors.values)
df['score'] = result
return df.sort_values('score', ascending=False)[:top_n]

Looking at the following example, without providing context these 2 sentences will have very high similarity. The model could not differentiate coronavirus and computer virus.

sent1 = sentence_vector(“new coronavirus affecting every household”)
sent2 = sentence_vector(“computer virus is affecting every household”)
model.cosine_similarities(sent1, [sent2])
array([0.9350175], dtype=float32)

With context, we could direct the model to better understanding the sentences. The first sentence we are talking about a biological virus and the second sentence we are talking about digital virus. This greatly reduces the similarity score.

sent1 = sentence_vector(“new coronavirus affecting every household”, positive=”biology”)
sent2 = sentence_vector(“computer virus is affecting every household”, positive=”digital”)
model.cosine_similarities(sent1, [sent2])
array([0.64123195], dtype=float32)

I use Spacy to split the document into sentences and then tokenize them into words. The word vectors are then averaged and the result will be our document vector (Average word embedding). The documents from dataset do not need to add context because they are not a single sentence, they have a lot of keywords in them to provide context.

3. Extracting related documents

We are doing this step because we want to build a knowledge graph using only the relevant documents.

4. Calculate similarly matric for top n related documents

Take note that our final matric will be n x n matric if you find your machine is too slow when computing reduce the n will help.

5. Building knowledge graph using the similarity vectors

Number of nodes: 710
Number of edges: 4310
Average degree: 12.1408

These clusters show how documents are related to one another. We can see there are some outliers and our centre cluster contain the most important information for our query. We can use different centrality algorithm to extract top n document and recommend to our users. The following are a short explanation of what different centrality does.

Degree centrality: Measures the number of incoming connections. Can be interpreted as popularity nodes, top n nodes with are the ones with the most connection.

Closeness centrality: Measures how quickly (minimum number of steps) can one node connect to others in the network. Can be interpreted as centralness nodes.

EigenVector centrality: Measures node connectivity to those who are highly connected. Interpret as influencing nodes and nodes that exercising control behind the scenes.

Betweenness centrality: Measure a node that acts as a bridging node to the others.

6. Using the degree of centrality to identify the most useful documents

Top 3 results from our knowledge graph are shown below. The result looks good it is answering what user query ask for.

User query: New diagnostic methods and products to improve clinical processes.

Context: medical, pneumonia, fast, test kit

Doc1: Advances in the diagnosis of respiratory virus infections.
Background: Advances have been made in selecting sensitive cell lines for isolation, in early detection of respiratory virus growth in cells by rapid culture assays, in production of monoclonal antibodies to improve many tests such as immunofluorescence detection of virus antigens in nasopharyngeal aspirates, in highly sensitive antigen detections by time-resolved fluoroimmunoassays (TR-FIAs) and biotin-enzyme immunoassays (BIOTH-E), and, finally, in the polymerase chain reaction (PCR) detection of respiratory virus DNA or RNA in clinical specimens...
Doc2: Diagnostic Techniques: Microarrays.
Current techniques for viral detection and discovery, which include culture and serological methods as well as polymer chain reaction (PCR)-based protocols, possess a variety of inherent limitations. In an effort to augment the capabilities of existing diagnostic methodologies, the use of virus-specific DNA microarray technology has been recently applied in both research and clinical settings with favorable results. The primary advantage of this approach is that DNA microarrays containing literally thousands of virus-specific sequences allow simultaneous testing for essentially all known viral species...
Doc3: Modernising epidemic science: enabling patient-centred research during epidemics.
BACKGROUND: Emerging and epidemic infectious disease outbreaks are a significant public health problem and global health security threat. As an outbreak begins, epidemiological investigations and traditional public health responses are generally mounted very quickly. However, patient-centred research is usually not prioritised when planning and enacting the response. Instead, the clinical research response occurs subsequent to and separate from the public health response, and is inadequate for evidence-based decision-making at the bedside or in the offices of public health policymakers...

The first two documents show us ways to do testing, interestingly third document highlight about modernizing clinical research. Which is true in some sense, if you have an old research facility you don’t even need to think about diagnosing. Our society often priorities the wrong things and neglect what is given to us.

A picture speaks a thousand words

Next, we will look at the next query to see it work as expected. Note that this time around I am using Eigen Vector Centrality. Purpose is to explore to see if this gives us a good result.

And again top 3 results from our knowledge graph are shown below.

User query: Movement control strategies to prevent secondary transmission, health care and community settings.

Context: pneumonia, medical

Doc1: Practical recommendations for critical care and anesthesiology teams caring for novel coronavirus (2019-nCoV) patients.
This paper summarizes important considerations regarding patient screening, environmental controls, personal protective equipment, resuscitation measures (including intubation), and critical care unit operations planning as we prepare for the possibility of new imported cases or local outbreaks of 2019-nCoV. Although understanding of the 2019-nCoV virus is evolving, lessons learned from prior infectious disease challenges such as Severe Acute Respiratory Syndrome will hopefully improve our state of readiness regardless of the number of cases we eventually manage in Canada.
Doc2: Experiences and challenges in the health protection of medical teams in the Chinese Ebola treatment center, Liberia: a qualitative study.
BACKGROUND: Health care workers are at the frontline in the fight against infectious disease, and as a result are at a high risk of infection. During the 2014–2015 Ebola outbreak in West Africa, many health care workers contracted Ebola, some fatally. However, no members of the Chinese Anti-Ebola medical team, deployed to provide vital medical care in Liberia were infected. This study aims to understand how this zero infection rate was achieved. METHODS: Data was collected through 15 in-depth interviews with participants from the People’s Liberation Army of China medical team which operated the Chinese Ebola Treatment Center from October 2014 to January 2015 in Liberia. Data were analysed using systematic framework analysis.
Doc3: Timely mental health care for the 2019 novel coronavirus outbreak is urgently needed.
he emergence of the 2019-nCoV pneumonia has parallels with the 2003 outbreak of severe acute respiratory syndrome (SARS), which was caused by another coronavirus that killed 349 of 5327 patients with confirmed infection in China.3 Although the diseases have different clinical presentations, the infectious cause, epidemiological features, fast transmission pattern, and insufficient preparedness of health authorities to address the outbreaks are similar. So far, mental health care for the patients and health professionals directly affected by the 2019-nCoV epidemic has been under-addressed, although the National Health Commission of China released the notification of basic principles for emergency psychological crisis interventions for the 2019-nCoV pneumonia on Jan 26, 2020.

Document 1 shows the need for screening, environmental control (I believe should be quarantine?), protective equipment is essential to prevent transmission. What surprised me is that document 3 shows us that the mental health of health care professional is also being affected. Although health care professional often has seen many death in their career, such a high number of death and consistently decide who to live or die is having a huge blow to their mental health. We can imagine how would not facing this mountain load of stress not affect their job in hospital. So let salute to these front line heroes and their selfless contribution. We should also do our part flattening the curve to reduce the load for our health care system.

Thank you for reading until the end. This is my first post and might have errors in my implementation. Do correct me if you find any mistakes. Let hope all of us will get through this pandemic and continue our daily life.

P.S: Please leave your house only if you have to and remember to wash your hands. Stay safe.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Yiwen Lai

Written by

Yiwen Lai

🤖 AI² | NTU Computer Science Graduate | NUS M.Tech Knowledge Engineering |

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem