Avalanches of Data: Text mining inspired by proton collisions

Digging through the CORD-19 dataset using NLP

Published in

Analytics Vidhya

8 min readSep 16, 2020

Many of us trapped in our increasingly disheveled home offices realize the difficulty in sifting through information. Trying to find urgent pieces of key information in an avalanche of notes, scrap pages, and office documents can feel tedious. Much like surviving an actual avalanche two choices are clear: move to the side or find something sturdy to grab and hold yourself up.

Epidemiologists, health care workers and public health experts find themselves in a similarly extreme situation during the COVID-19 pandemic, as they face a growing wave of scientific papers. Open source and machine readable datasets - like the CORD-19 dataset - offer a continually expanding collection of tens of thousands of published articles and pre-prints from around the world dating as far back as 1870. Since its creation in early January, the CORD-19 dataset initially doubled every day and at present is updated with new publications every week. How can anyone keep up?!

Strong support from the scientific community and new data science tools are an essential element of the continued fight against COVID-19. Among scientists around the world, CERN is a key community and has already served to bring a number of innovations to the world. This particle physics center is not only a large international community of scientists, but carries a wealth of expertise in engineering and software development that is used to combat COVID-19 (from visors to venthilators to contact tracing apps). The largest expertise in particle physics though is the management of big data, as CERN generates, processes, and creates data-driven research models everyday.

Recently, I have been building a bridge from my expertise as a particle physicist to new adventures in Natural language processing during the COVID-19 pandemic. I found many interesting parallels in building a natural language processing package and analyzing particle physics data.

Analyzing the explosion of a proton and the explosion of COVID-19 papers is similar because they both rely on the backbone of the scientific method. Both techniques rely on filtering data using simple signatures and then building a more global context from combinations of similar features. In this article, I will highlight these shared steps of the scientific method and how they are applied in both cases:

We start with the problem of gaining insights from a huge wealth of information.
We understand the limitations of our tools (CPU, memory, and processing time) to wrangle the data
We explore how to break up the insurmountable data into manageable pieces, so that it can be quickly filtered and analyzed.
We combine the pieces based on how mathematically similar they are to build a global context.

How does analyzing data about the big bang intersect with summarizing text in CORD-19?

Particle physics data consists of patterns in measurements that can be separated into hot topics and more mundane data. This approach is analogous to looking for keywords or topics in huge text data by separating more specific words and phrases from the generalities of more mundane text.

Data scaling, transforming data to be analyzed within CPU or GPU resources and fit within memory, is a key issue both at CERN and also in mining datasets like CORD-19. CERN records and processes data live as proton-collisions (up to 1 billion per second) take place in each experiment. Only about 1% of the data is recorded and fully processed, so that the data from all the experiments is scaled to several petabytes processed in about 25 GB/s on a world-wide computing grid (for each experiment the data flow ranges from 600 MB/s to 2 GB/s ). Similarly, Natural language processing (NLP) algorithms often require several matrix operations and features need to be limited so that the program does not take huge amounts of time or exhaust memory. Processing pipelines like spaCy can reach a memory limit (2GB for text length of 1 million characters) for really long pieces of text, thus halting the parsing of sentences into keywords and phrases. Both processing pipelines need to be streamlined to rapidly process text or proton collisions without exhausting available computational power.

Finding signatures is a key part of any data science project, but often needs to be tailored to get particular insights. The trigger system in experiments like CMS at CERN first use simple algorithms in fast electronics to decide if a collision has an interesting signature to be recorded. Then in a second step, the recorded data is quickly processed in software on CPUs to more specifically categorize data. This two step approach can greatly reduce the data to a more manageable amount for full processing on a computing grid without removing potentially interesting collisions.

For an analysis of the CORD-19 dataset, the signature could be as simple as a mention of a general keyword like COVID-19 or one of its synonyms, this provides some indication that a document focuses on COVID-19 instead of West Nile Virus or the Ebola virus. A two step selection system like the one described above can be built in software using python NLP packages. Rapid Automatic Keyword Extraction (RAKE) can be a fast processing tool that can quickly build a correlation matrix of words and word pairs to identify keywords. RAKE is used to score phrases in the title and abstract of a paper providing feature inputs for the next step.

The chart below shows the keywords found using RAKE on the abstract and titles of the CORD-19 dataset in cases where “COVID-19” and “intensive care unit” are found as keywords. As expected, based on the selection in the data, the words that occur most frequently in the abstracts of articles are “intensive care unit” along with “novel coronavirus disease”. More interestingly, looking at some of the other frequently occurring words (“acute respiratory failure”, “care unit admission”, “invasive mechanical ventilation”) , you can begin to understand a rough context for this topic, which can be refined in the next step.

Most frequent phrases (top 15) found in CORD-19 documents that mention COVID-19 and Intensive care

As seen above, the more mundane words like “intensive care unit” and “novel coronavirus disease” occur frequently based on the selection, however this information does not provide a very specific insight. Searching for more rare words like “polymerase chain reaction” and “invasive mechanical ventilation” would provide a richer set of search results and greater insight into particular issues. To emphasize the rare but important words in searching the text, it is better to rank them by term-frequency inverse document frequency (TF-IDF) weight, which accounts for both how frequently a phrase occurs in a single abstract (importance for a given paper) and how frequently it occurs across all papers(how general or mundane it is).

A TF-IDF matrix can then be factorized into topics using Non-negative matrix factorization (NMF). This step is slower and involves more CPU, but it results in a more refined set of keywords e.g. “respiratory distress”, “respiratory support”, “risk factors”, “polymerase chain reaction”, “computed tomography”, and “personal protective equipment”.

Building a context is the next step after exploding and separating text data into keywords and phrases to select documents. An input set of keywords that correspond to a topic are used to collect documents and sentences about that topic.

Photo of a recorded proton-proton collision courtesy of the CMS collaboration. This display of the reconstruction of the particles resulting from the collision is analogous to building up a global context from text features.

In the same way, the measurements from a disintegrated proton are rebuilt into a set of particles that tell the story of the collision. This results in the picture above of the reconstruction of a collision. The picture shows particles that can be associated to a fragment of the proton as yellow cones. Known particles with specific features like electrons and muons are reassembled from their characteristic measurements. The reconstructed particles can then give a global context for the collision. The global feature of the above proton collision is that the momentum of all the particles does not balance, and the purple line could be the trajectory of a single particle escaping the detector with large momentum (Very exciting data!).

Keywords can likewise still be too general to create a global context. Key phrases (analogous to the reconstructed particles), which contain keywords (analogous to measurements), are more useful and can be built by ranking important parts of the abstract. In this stage, PyTextrank is useful as a text summary tool. The plot below shows how pyTextrank scores phrases based on relevance. Stop words like “we”, “it”, “who”,“them” have a score of 0.0. Less relevant phrases like “our university hospital” are ranked at lower values though they appear more frequently. Longer more specific phrases, though less frequently occurring, like “invasive and non-invasive mechanical ventilation”, are scored at larger values. A general topic word like “respiratory failure” is made more specific with “acute hypoxic respiratory failure” using PyTextrank.

A 2D scatter plot of the frequency of a phrase and its PyTextRank score for a few phrases in the abstracts from documents about COVID-19 and Intensive Care Units.

Ranking and grouping text gives a more pronounced summary of key insights. Similarly, the yellow cones in the proton collision picture emphasize that the proton violently disintegrated into just a few distinct pieces that seemingly recoil against something unmeasured. The yellow cones are a careful combination of particles based on how probable it is that they come from the same fragment. For text, cosine-similarity can group documents, paragraphs, and sentences based on their similarity. TF-IDF converts a document into a matrix that emphasizes rare words by a TF-IDF score. The matrix of TF-IDF values can be multiplied by itself (or transpose) to give a similarity array. The array can then be transformed into a graph which scores how similar each document is to all the others. The same can be done for a paragraph of sentences that contain matched key phrases like the ones in the above figure.

The top 5 ranked TF-IDF paragraphs give information about the same key phrase and in certain cases verify the same conclusion. For example here, the summary paragraphs of the highest two ranked documents for the topic “Respiratory and organ failure” state that most COVID-19 patients develop only mild or uncomplicated illness, but 14–20% develop more severe symptoms and a percentage of these patients require admission to the Intensive Care Unit. These paragraphs allow to validate different fragments of information (i.e. percentage of ICU admissions) and provide a coherent picture for the topic.

The most granular information is in the matched sentences which can be ranked across all documents. Ranking sentences results in the largest matrices and graphs so this is where the data scaling is most important. Ranked sentences provide the most concise insights for an expert, e.g. “In patients with COVID-19, the severity of hypoxemia is independently associated with in-hospital mortality and can be an important predictor that the patient is at risk of requiring admission to the intensive care unit (ICU)”. This sentence found in this paper is ranked in the top 5 sentences for the topic ICU management and treatment here.

In the avalanche of information on COVID-19, the scientific method provides a sturdy handle to hold up the best researchers and scientists and keeps us on firm ground until we reach a summit. If you are feeling buried, give the CORD-19 challenge a try or if you are interested in projects for “big science” researchers check out ScienceResponds.

Stay Tuned for more articles:

CORD Crusher: The python code loosely described in this article is now fully explained in CORD Crusher: Slicing the COVID-19 Data into Summaries

CORD-19 Insights: A Medium article where I summarize the insights I found on COVID-19 from the CORD 19 dataset using the code linked above.

Avalanches of Data: Text mining inspired by proton collisions

Digging through the CORD-19 dataset using NLP

Stay Tuned for more articles:

Written by Rishi Patel, PhD