AI and Natural Language Processing and Understanding for Space Applications at ESA

Part IV: Information extraction for Long-Term Data Preservation in space

Andrés García-Silva
8 min readNov 2, 2022

By José Manuel Gómez-Pérez, Andrés García-Silva, Rosemarie Leone, Mirko Albani, Moritz Fontaine, Charles Poncet, Leopold Summerer, Alessandro Donati, Ilaria Roma, Stefano Scaglioni

Image source:esa.int

This post is a brief overview of part of a journal paper that is currently under review (see preprint here), where we describe the joint work between ESA and expert.ai to bring recent advances in NLP to the space domain.

We have split the post in several parts:

Introduction

Among the main goals of the European EO Long Term Data Preservation Framework, the need to ensure and facilitate data accessibility and usability is a key one. To achieve this goal it is necessary to enhance the ability of machines to automatically find and use scientific information in addition to supporting reuse by individuals.

Text analytics systems can contribute to this vision by automatically extracting information from relevant sources, such as publications, technical reports, mission feasibility studies, design documents or mission reports, in the form of machine-readable metadata. Such metadata is then instrumental for discovery e.g. supporting the development of enhanced information retrieval systems, such as search and recommendation engines.

Analysis

This case study focuses on information extraction in space. We could not find publicly available annotated text datasets in the space domain or models trained to detect entities in such documents. Launching an annotation campaign to produce annotated datasets for all the different extraction tasks involved was not in our budget. Thus, using a machine learning approach for this case study was not possible beyond resorting to some general-purpose model pre-trained to detect common entities.

On the other hand, expert.ai’s text analytics services come with an out-of-the box general purpose knowledge model that can be extended to different domains by injecting domain terminology in its underlying knowledge graph. ESA counts with an instance of expert.ai text analytics services in their infrastructure which we decided to extend.

We present the methodology followed to extract domain-specific terminology and its integration in expert.ai’s pre-existing knowledge graph, extending and customizing it for the main target domains of the case study, Earth and Environmental sciences, to support the development of domain-specific text analytics services for information extraction

Approach

In this case study we use expert.ai text analytics services hosted at ESA’s facilities (Cogito Discover) to extract metadata from the documents in the space domain. The text analytics services rely on a general-purpose lexico-semantic knowledge graph where linguistic knowledge is encoded as a semantic network of concepts and relationships between them.

The nodes of this knowledge graph are concepts linked to each other through semantic and linguistic relationships in a hierarchical structure. Each concept has a main lemma, which is a canonical representation of words and collocations without conjugation, number or gender. The meaning of a word or expression in the text comes as a combination of its main elements after disambiguation (grammar type, concept, definition/glossa, domain, and frequency relations), as well as the different types of connections, e.g. hypernymy, hyponymy with other concepts.

This explicit representation results in a greater ability to understand language, which can be adapted to a specific domain by adding new concepts and relations that enrich the metadata extraction and improve text comprehension. Expert.ai’s standard knowledge graph contains approximately 400K lemmas, 300K concepts, and 80 different types of relations, rendering 3 million links between concepts. To make the text analytics services suitable for space and particularly Earth and Environmental sciences we need to extend and adapt the graph.

After a preliminary stage where we leveraged ESA’s internal terminologies and other resources like ESA Technology Tree, the ESA corporate taxonomy and the glossary of long-term preservation of earth observation space data, we focused on Springer Nature’s SciGraph, which contains a document corpus stemming directly from the scientific community in areas relevant to our target domains among many others. SciGraph is a knowledge graph of scholarly communications covering funding agencies, research projects, conferences, affiliations, and publications. SciGraph uses the Fields of Research classification (FOR) to classify the publications.

Focusing on the target domains we conducted a survey with a group of 12 volcanologists, sea observation scientists and climatologists to determine the most relevant fields of research for their work. To extract our corpus we applied two filters to the SciGraph Articles Dump: i) Publication Date greater than 2016 and ii) Fields of Research in the list of fields selected in the survey.

The resulting corpus contains 49.693 articles, with 13M tokens, among which 271K are unique. 61% of them (30.190 articles) are labeled as Earth Sciences papers, while the rest (19.503 articles) belong to the Environmental Sciences field. The main subcategories in Earth sciences (36%) are Geology and Physical Geography (25%), while the remaining subcategories are uniformly distributed over the rest of the sample. In the case of Environmental sciences, Environmental Science Management (55%) and Soil sciences (43%) are dominant.

Terminology analysis

To analyze the corpus we use expert.ai’s text analytics services to detect candidate concepts that are not already encoded in our knowledge graph. Such candidate concepts, along with multi-word expressions also detected by the text analytics engine, are used to enrich the knowledge graph. In addition, named entities like people, organizations, and places are manually inspected to detect errors and improve the accuracy of the named-entity recognition (NER) module. Finally, we carry out a weirdness index analysis to detect words that are specific to the target scientific domains.

We process the corpus and extract metadata from the documents by feeding the text mining services with the title and abstract of each paper. We generate the following metadata:

  • Domain: Field(s) of knowledge, based on main concepts.
  • Organizations: Organization names or aliases.
  • People: Person names or aliases.
  • Places: Places names or aliases.
  • Known Concepts: Concepts found in the text which are in the knowledge graph.
  • Concepts: Concepts in the document that are not in the graph.
  • Main Syncons: Most relevant concepts mentioned in the text that are represented in the graph.
  • Main Groups: Most relevant noun phrases and multi-word expressions in the text.
  • Main Lemmas:Most frequent lemmas found in the text.
  • Main Sentences: Most relevant sentences found in the text.

Table 1 shows the top 10 most frequent lemmas that were not linked to a concept in the knowledge graph. Most of the unknown terms are chemical compounds and measures, which is not surprising since the knowledge graph was not originally conceived to cover Chemistry. Main groups in table 2 contain phrases of nouns, verbs and prepositions. After identifying the candidates to be included in the graph, a knowledge engineer needs to determine which ones need to be represented as a concept, as well as the exact location of the graph and form of such representation.

Table 3 shows the number of words per metadata type shown by our analysis to be previously known or unknown in the knowledge graph. Note that not all terms need to be included. For example, only unknown lemmas and groups that are ambiguous need to be integrated in the graph so that they can be disambiguated properly. Since there is a considerable number of unknown lemmas, groups and entities, we apply the Pareto principle and focus on the 20% most frequent words for each metadata type. This subset of words is handed to a team of knowledge engineers and linguists in charge of their integration in the expert.ai knowledge graph. In total, the knowledge engineers need to analyze and process 5,070 words, with an estimated total effort of 2.5 person months.

Another approach to identify domain terminology is to distinguish terms that are more specific of the reference corpus over those that are more generalistic. To this purpose, similarly to Berquand et al., 2020, we apply Weirdness Index filtering to rank the candidate terms. The Weirdness Index allows comparing the use of a word, based on its frequency, between a domain-specific corpus and a large corpus representing general-purpose language. In this case, we use the British National Corpus (BNC) as our general corpus.Table 4 shows some of the terms with the highest and lowest Weirdness Index in our corpus.

Exploiting the extracted metadata

Once the knowledge graph has been extended and adapted to the specific domain, any collection of documents can be processed to extract metadata from them, which can then be used to improve access to such information. A live demo illustrating the semantic metadata presented in this paper that can be extracted from Earth and Environmental sciences documents can be found at: https://reliance.expertcustomers.ai/enrichment

Figure 1. RELIANCE enrichment service demo.

The metadata extracted and generated for documents collections can be indexed along with documents in an Elasticsearch index and visualized through a dashboard built with Kibana (see the image below). Kibana dashboards are interactive and allow visualizing and exploring a document collection based on the distribution of the information extracted from it as semantic metadata.

Figure 2. Kibana dashboard screenshot depicting the distribution of research objects in the NASA Scope and Subject taxonomy.

These services are also available at the European Open Science Cloud (EOSC) and are currently used among others by ROHub, an online platform that aims at managing, preserving, and providing access to research work, including scientific data, code, and literature, in order to extract information from research objects in a variety of scientific communities, which currently include among others Astrophysics and Bioinformatics, as well as Earth and Environmental sciences.

The resulting metadata can also be used to enhance search and recommendation engines, alleviating some of the limitations of keyword-based approaches, including query ambiguity and lack of semantics. Keyword-based search engines may miss documents that contain synonyms of query keywords and morphological variations such as verb conjugations or even plurals, with an impact on recall.

By leveraging semantic metadata generated as proposed above, where each concept identifies uniquely a word as well as other semantically related terms like synonyms and hyponyms, search and recommendation engines can be better equipped to deal effectively with ambiguity. Examples of this type of systems include the Collaboration Spheres,a search-by-example system whose evaluation showed the benefits of this approach to explore large collections of scientific documents, reducing the cognitive load associated with this task.

About expert.ai

Expert.ai is a leading company in how to apply artificial intelligence to text with human-like understanding of context and intent.

We have 300+ proven deployments of natural language solutions across insurance, financial services and media leveraging our expert.ai Platform technology. Our platform and solutions are built with out of the box knowledge models to make you ‘smarter from the start’ and get to production faster. Our Hybrid AI and natural language understanding (NLU) approach accelerates the development of highly accurate, custom, and easily explainable natural language solutions.

https://www.expert.ai/

--

--