- What is early stage research?
- Researchers are trying to access knowledge that was previously unavailable to them
- Knowledge discovery helps further scientific progress
- Insight into this data is crucial
The results of the scientific process that drives progress in a field are usually presented and published in an academic publication. Every academic field has journals and publishers that enable this. These contributions and their reception amongst the members of the scientific community is what allows progress to happen, be it in physics or medicine. The availability of other people’s insights and results is an essential part of this process; without it, we would not be able to “stand on the shoulders of giants” as Isaac Newton put it in 1675.
The way research is conducted, however, has changed a lot since the Renaissance. Today we regularly access knowledge on a scale that most people in history would have found difficult to grasp. This has ushered in a period of unparalleled scientific discovery that has benefited humans enormously.
In practice it can take a long time from the first idea until research is made available to an audience through actual publication in a scientific journal. Most research is presented at an early stage during scientific conferences in the form of scientific posters. The critical shortcoming in this method, however, is the lack of access for other people if they don’t happen to be at the same conference in the same hall at the same time, at the very moment the poster is being presented by its authors.
- Access to scientific results
- Making poster information not usable for machines
This presents both a problem and an opportunity to make this content accessible to the wider research community. However, the research in a scientific paper or on a conference poster is usually presented with a concern for visual appeal and is not designed to be easily searchable by a machine. This presents several issues if we want to build a search engine for scientific content available only in this form.
Even seemingly simple tasks like extracting the plain text from such posters is already a challenge because they do not stick to a fixed layout and contain arbitrarily structured information about their authors, universities, funding institutions and references. While it may be possible for a human to cull this information from a poster, getting a machine to perform it well is no easy feat. This is because machines are not good at processing unstructured data. To make it available for processing, we need to structure it.
- Analysing layout and text information
- Extracting pieces of information: Authors, organizations, funding, references, medical field
Our first approach to this problem was just extracting the plain text from the PDF file itself. This worked sometimes but essential information about the layout was lost. More problematic was the fact that getting the plain text from the PDF file was very unreliable because it depended on how the file was created. In some cases all we could extract was gibberish like “$(/§$%5623/@)”. This was obviously not usable so we took the next step: optical character recognition (OCR). We used OCR analysis software that was also able to extract the layout (bounding boxes for words, sentences, and paragraphs) and distinguish between these different sections. This was an inherently imperfect process but substantially more efficacious than the previous approach because it preserved at least some layout information and produced better, more reliable results.
We then mined searchable keywords from the results of the OCR process by using a controlled vocabulary tailored to the scientific field of medicine, as most of the content we were handling came from the medical domain. Armed with this process and academic vocabulary, we were able to extract relevant, recognizable scientific keywords from the conference posters and make them discoverable to other researchers.
There is a lot of additional information provided by scientific posters that is useful for researchers: institutional involvement, citations, and authors’ organizations all allow scientists to get to know their colleagues and begin collaboration. Finding all of this requires a detailed understanding of how such a document is structured (for example, it is unlikely that the references section would come directly below the title, even if the word “references” appears there as part of a normal sentence).
- Non-standard and arbitrary layout
- Vital information is hard for machines to extract
The major problems we had in this endeavour were the complete lack of a standardized format for content presented at a scientific conference. This is usually at the discretion of the individual authors, although some conference organizers do provide templates for content presented at their event. Still, not all organizers offer such templates and not all authors are willing to use them. This put us into the tricky situation of needing to analyze the content like a human would, i.e. by visually inspecting the document and inferring actionable knowledge from it. This process had its own challenges but to make early-stage research available to the people who build upon it, we were always motivated to successfully overcome them.
If you want to help us fulfill our mission of furthering the scientific progress by making early-stage research accessible, have a look at our job offerings here.