CAVIDOTS — CoronA VIrus DOcument Text Summarization

Ning Wang
6 min readApr 21, 2020

--

The coronavirus has severely endangered public health and damaged economies. To provide medical researchers with more efficient tools to acquire relevant and also salient information to fight the virus, we developed a query tool, CAVIDOTS (CoronA VIrus DOcument Text Summarization) accessible here for investigating COVID-19 related academic articles provided in this dataset. The differentiating factor of CAVDOTS from other relevant tools such as COVIDEX or Ludwig Guru COVID-19 Search Engine is that CAVDOTS provides the most salient abstractive summaries of all academic articles rather than simply retrieving the original documents. This is a time saver for medical researchers guiding them to the key information.

Large-Scale Multi-Document Summarization Framework

There exist several multi-document summarization algorithms. So, you might wonder, why we are developing a new approach to this seemingly well-explored problem. The issue with existing algorithms is that the documents they are summarizing are often-times describing a particular event or topic. That means no matter if there are two or twenty articles they are trying to summarize, it is likely the length of the summary can be controlled to a specific range. It also makes it easier to quickly identify salient information. For example, if there is a single topic being discussed over and over again across all documents, it would indicate that information is relevant and should be part of a summary.

In our case, however, we are aiming at a much more generic collection of documents, where there is no guarantee that the documents we are trying to summarize are discussing the same particular event or topic. For example, one article in the CORD-19 dataset can be discussing the effect of a particular protein on damping the replication of the SARS virus while another article can be discussing the experiment of a vaccine on a group of people whose ages are from 30 to 50. Because different medical researcher has a different focus, directly applying existing multi-document summarization approaches to CORD-19 may not result in an acceptable outcome, as it is hard for a system to automatically determine which content is more important for one researcher and which content for another.

To understand how we solve this unique problem, I will guide you through with the help of an illustration of our current approach.

An overview of the framework.

That consists of three steps: single-document summarization, sentence merging, and content-based grouping. Each of these steps is highly customizable and flexible, meaning that the specific algorithms used can be easily replaced with newer and better ones.

Single-Document Summarization

In the illustration, each color indicates a topic the document is discussing. To obtain the gist of each article, we acquire the individual summaries, where each summary is illustrated as a circled “i” in the above picture.

This step can be accomplished with any state-of-the-art single-document summarization method, supervised or unsupervised. Because there exist no summarization datasets for medical data, we leveraged TextRank, a well-known unsupervised algorithm. You can read more about it here.

Sentence Merging

You may wonder, why can we not use the individual summaries from the first step directly? A short answer is that there are over 300,000 sentences from the first step when we applied a summary ratio of 0.03 for the TextRank algorithm. On a positive note, many summary sentences can essentially be discussing very similar topics. Such sentences can thus be merged to reduce information redundancy. This reduction step is performed by embedding the 300,000 sentences and then clustering them. Standard K-Means and hierarchical clustering algorithms exhibit scalability issues and thus we “manually” apply divisive clustering as follows.

To identify which sentences share similar meanings, we utilized BERT, a well-known language model used to acquire sentence embeddings, to obtain for each summary sentence a vectorized representation. Next, we hierarchically cluster all sentences with the distance above ​. Many sentences belong to their own clusters. For clusters containing more than one sentence, we proceed to examine the size of the cluster. If the number of sentences in a cluster is smaller than average per cluster number of sentences, we keep the cluster. If the size is larger, we further split it by setting ​ to be half the previous value, i.e. ​. We proceed in this manner until no clusters can be further split.

An illustration of a word graph.

Our next step is to merge the sentences. We adopted an unsupervised method based on word graphs and the k-shortest path algorithm. A word graph is a weighted directed graph that can be constructed from a set of sentences. It has a source and sink node, labeled start and end respectively. The start node is connected to nodes representing each of the first words in the sentences. Then, each word node is connected to a node representing the next word in the corresponding sentences. Under certain circumstances, a node can be shared if more than one sentence contains a corresponding word. The edge weight is determined by the degree of association between the connecting nodes. For the k-shortest path algorithm, invert edge weight is used so that higher association score results in lower cost. In the example graph shown above, the if we set ​, the resulting sentences are:

  • In Asia Hong Kongs Hang Seng index fell 8.3%.
  • Elsewhere in Asia Hong Kongs Hang Seng index fell 8.3%.
  • Elsewhere in Asia Japan Nikkei lost 9.6% while Hong Kongs Hang Seng index fell 8.3%.

You can find more details here.

While the majority of the merged sentences are grammatically connected and meaningful, some may have grammatical issues, making it difficult to understand. We alleviate the issue by introducing a uni-directional language model to the vanilla k-shortest path algorithm so that in addition to the static edge weight, we also consider the probability of a word node to be chosen next, where the probability depends on the partial path under consideration. In essence, we employ a constrained k-shortest path algorithm. If a node is not very likely per the language model, the distance is increased, hence presumably resulting in shortest paths that correspond to more grammatical sentences. At the end of this clustering and merging phase, we end up with 55,721 sentences.

Content-Based Grouping

In step 2, we managed to compress the number of summary sentences from around 300,000 down to 55,721. However, that still is a large number of sentences to read through. Therefore, on top of the summaries, we group them by certain keys. In the case of CORD-19 data, we identify various named entities using the en_ner_bionlp13cg_md model from the Python module scispacy. We group summary sentences based on those entities.

For a more presentable and easy to use interface, we provide this web app where users can simply search for the summary sentences based on cell names, or amino acids, etc..

A screenshot of the content retrieval app.

Next Steps

Our web app and summarization-related algorithms are still under heavy development. One major future feature is the support of summary to document linking, so users can quickly find the corresponding articles related to the summary sentences.

Acknowledgment

This work is performed in collaboration with Prof. Diego Klabjan (dklabjan<at> northwestern<dot>edu) and Prof. Han Liu (hanliu<at>northwestern<dot>edu) at Northwestern University.

--

--