Which research fields have emerged in the last years? A machine learning approach

Antonio Campello
Wellcome Data
Published in
6 min readJun 30, 2020

With a science portfolio of over £3.5bn (as of March 2020), the Wellcome Trust has funded research responsible for at least tens of thousands of academic publications in the last 5 years alone. This volume of publications presents a challenge for grant advisors and data analysts when tracking research outcomes. In particular, Wellcome Data Labs has often been asked the recurring question:

How can we visualise the research fields/topics that have emerged from our funded grants?

This complex question has been investigated previously, for example in the field of neuroscience by tracking journeys of researchers in conferences. Recently, Wellcome-led Research on Research Institute has approached this question by analysing how researchers cite each other and visualising their networks.

In this post, we will describe a machine learning method used at Wellcome Data Labs to develop a tool that can produce re-usable research portfolio charts from research texts, including grant synopses and academic publications.

What is a research field?

The first surprising challenge we face when trying to visualise research fields is a semantic one: what is a research field? One way to answer the question is to look at the existing classification system for the academic literature. For instance, in the medical domain, some prominent systems are Medical Subject Headings (MeSH) and International Statistical Classification of Diseases and Related Health Problems (ICD). These are very complete, tree-based tagging systems. They can include classifications as broad as “Diagnosis” and as narrow as “Four-dimensional computer tomography”. In other domains, such as computer science, some online databases (most notably ArXiv) provide broad classes including Information Theory and Artificial Intelligence.

In practice, it is challenging to deal with those systems due to the fact that:

  • Some systems are either too granular or not granular enough.
  • It is often hard to define sub-areas of interest, as they can span over multiple classifications, which would need a large component of domain knowledge intervention to provide the right “cut” of the data.
  • Some fields — in particular emerging ones — are hard to define and do not have a one-to-one association with existing granular classifications.
  • Not all data (in particular grant-related data) is tagged. (In fact a companion project at Wellcome Data Labs is currently looking at how to automate tagging using machine learning).

All points above make a strong case to the use of unsupervised learning, the sub-domain of machine learning that deals with untagged/uncategorised data. Using unsupervised learning, we analyse research solely from its (raw) text documents, such as publications titles and abstracts. We will describe next three particular techniques that we used, sequentially, for this task: embedding, dimensionality reduction and clustering.

Unsupervised learning thrives in situations where data is not categorised or fields are hard to define from pre-existing classification systems.

Sample dataset

For the sake of a concrete example, from now on we will use the open data provided by the excellent ArXiv database, to analyse the fields of research that emerged in the Machine Learning academic literature in the last 5 years.

What do research fields look like?

The first step to apply unsupervised learning to this dataset is to transform texts into numbers that can be interpreted by machine learning. This process can be loosely referred to as embedding. We can achieve embeddings by counting the frequency of words in a text, or by means of more sophisticated techniques that try to preserve the semantics of texts, such as word2vec or BERT. This will associate to every publication title and abstract, a sequence of, say hundreds of numbers.

After converting texts to numbers, we can then employ a technique called dimensionality reduction (we have used a particular one called t-SNE), so we can plot publications in a graph or chart. This technique, alongside with embedding, tries to preserve similarities between texts. The result is a chart similar to the one below, where each point represents an academic publication.

At this point, finding a research field out of the dataset is still daunting, and can look like finding a needle in a haystack. In order to facilitate the process, we need to cluster the publications together.

Clustering

So far we have plotted points in a graph that allows us to visualise academic publications as if they were points in a graph. The last step to make sense of this data, is to apply a procedure called clustering that will essentially colour similar research points together, each colour representing a field. Below is a snapshot with some of the clusters for the machine learning publications on ArXiv after applying a clustering technique called DBScan.

A snapshot with some of the clusters for machine learning related academic publication

This neighbourhood is centred around a cluster that we named “reinforcement learning” after inspecting the articles. Reinforcement learning, the area of machine learning where programs learn to take actions based on interactions with an environment, gives the base to many applications that show up in their neighbouring clusters, namely Gaming & Machine Learning, and Stock Prediction. Some of its publications are listed below.

Some publications related to the “machine learning and gaming” cluster

How many fields should we have? A common question is how we define the number of research fields that we should be looking at, or in our case, the number of clusters. Depending on how we optimise the clustering algorithm, it can output anything between tens to hundreds of clusters. To adjust the optimal number, there are a couple of machine learning metrics that can help us (for example, the silhouette score). Nevertheless the final number of clusters will probably be a combination of machine-learning metrics and domain knowledge. From our experience, the final cluster names and the number of cluster will be a result of an iterative process with stakeholders, that looks like this:

  1. Present a certain cluster chart and a dataset with some representative examples for each cluster to domain-knowledge experts
  2. Receive indications on clusters that need merging or clusters that need splitting
  3. Re-run clustering to split and merge
  4. Apply 1–3 until we reach a reasonable user-acceptance level.

Wrapping up a visualisation and drawing insights

An interactive visualisation that combines all the steps above for our sample machine learning dataset can be found here (it can take a few seconds to load). For instance, in our hypothetical example (with real data!) of analysing machine learning publications, in addition to the reinforcement learning green cluster in the South of the map, there are a couple of qualitative insights we can generate. For instance, we can see from the chart that traditional topics such as “gaussian models”, “graphical models” around the middle of the chart have lost interest over time, whereas there is a growing interest in topics such as “fairness”, “emotion recognition” using pictures and “adversarial machine learning”.

In general, we found that, besides bringing the machine learning models to life, interactive charts are very powerful to engage stakeholders in the process, which will refine the clustering process in a way that will be useful to posterior analyses.

How does this help research funding decisions?

Coming back to our original use case, a research chart help generate new qualitative insights, new cuts of the data, and serve as an entry-point for further discussions. By having a broad view of the research landscapes, one can very quickly identify areas of emerging interest. In addition, as mentioned in previous academic research, a research field visualisation tool can help funders ensure appropriate coverage or appropriate focus, depending on the objective of each specific division.

--

--

Antonio Campello
Wellcome Data

Data science. All things data governance, machine learning and open data.