Gathering insights into early-stage research

Sebastian Rose
May 11, 2020 · 4 min read
Kalpana, Sara and Sebastian from our data team

Introduction

  • What is early stage research?

The results of the scientific process that drives progress in a field are usually presented and published in an academic publication. Every academic field has journals and publishers that enable this. These contributions and their reception amongst the members of the scientific community is what allows progress to happen, be it in physics or medicine. The availability of other people’s insights and results is an essential part of this process; without it, we would not be able to “stand on the shoulders of giants” as Isaac Newton put it in 1675.

The way research is conducted, however, has changed a lot since the Renaissance. Today we regularly access knowledge on a scale that most people in history would have found difficult to grasp. This has ushered in a period of unparalleled scientific discovery that has benefited humans enormously.

In practice it can take a long time from the first idea until research is made available to an audience through actual publication in a scientific journal. Most research is presented at an early stage during scientific conferences in the form of scientific posters. The critical shortcoming in this method, however, is the lack of access for other people if they don’t happen to be at the same conference in the same hall at the same time, at the very moment the poster is being presented by its authors.

Problem statement

  • Access to scientific results

This presents both a problem and an opportunity to make this content accessible to the wider research community. However, the research in a scientific paper or on a conference poster is usually presented with a concern for visual appeal and is not designed to be easily searchable by a machine. This presents several issues if we want to build a search engine for scientific content available only in this form.

Example of a conference poster

Even seemingly simple tasks like extracting the plain text from such posters is already a challenge because they do not stick to a fixed layout and contain arbitrarily structured information about their authors, universities, funding institutions and references. While it may be possible for a human to cull this information from a poster, getting a machine to perform it well is no easy feat. This is because machines are not good at processing unstructured data. To make it available for processing, we need to structure it.

Process

  • Analysing layout and text information

Our first approach to this problem was just extracting the plain text from the PDF file itself. This worked sometimes but essential information about the layout was lost. More problematic was the fact that getting the plain text from the PDF file was very unreliable because it depended on how the file was created. In some cases all we could extract was gibberish like “$(/§$%5623/@)”. This was obviously not usable so we took the next step: optical character recognition (OCR). We used OCR analysis software that was also able to extract the layout (bounding boxes for words, sentences, and paragraphs) and distinguish between these different sections. This was an inherently imperfect process but substantially more efficacious than the previous approach because it preserved at least some layout information and produced better, more reliable results.

We then mined searchable keywords from the results of the OCR process by using a controlled vocabulary tailored to the scientific field of medicine, as most of the content we were handling came from the medical domain. Armed with this process and academic vocabulary, we were able to extract relevant, recognizable scientific keywords from the conference posters and make them discoverable to other researchers.

There is a lot of additional information provided by scientific posters that is useful for researchers: institutional involvement, citations, and authors’ organizations all allow scientists to get to know their colleagues and begin collaboration. Finding all of this requires a detailed understanding of how such a document is structured (for example, it is unlikely that the references section would come directly below the title, even if the word “references” appears there as part of a normal sentence).

Challenges

  • Non-standard and arbitrary layout

The major problems we had in this endeavour were the complete lack of a standardized format for content presented at a scientific conference. This is usually at the discretion of the individual authors, although some conference organizers do provide templates for content presented at their event. Still, not all organizers offer such templates and not all authors are willing to use them. This put us into the tricky situation of needing to analyze the content like a human would, i.e. by visually inspecting the document and inferring actionable knowledge from it. This process had its own challenges but to make early-stage research available to the people who build upon it, we were always motivated to successfully overcome them.

If you want to help us fulfill our mission of furthering the scientific progress by making early-stage research accessible, have a look at our job offerings here.

Morressier

Accelerating scientific breakthroughs

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store