Laying Foundations for Connecting Data Stewardship Domain Ontologies

Martij
6 min readDec 2, 2023

--

In the domain of data stewardship, numerous ontologies and vocabularies are available for semantically describing data. While this diversity offers flexibility, it also introduces the challenge of selecting the right ontology, determining when to use it, and understanding how to do so effectively.

This article aims to share the findings of our research in the realm of semantic data description within the field of data stewardship, as published in the 22nd International Conference on New Trends in Intelligent Software Methodologies, Tools, and Techniques. Our research entails a comprehensive examination of various ontologies and vocabularies relevant to research data management and data management plans across scientific domains. Our primary focus was the analysis of their correctness and the identification of overlapses and relationships among these resources in order to define potential new connections that can support the accurate representation of data management plans.

What do we mean by ontologies and vocabularies?

A vocabulary refers to a grouping of terms, each possessing clear and consistent definitions. Ontologies, on the other hand, go beyond mere term collections by incorporating well-defined relationships and other connections between these terms. These constructs, ontologies and vocabularies, find application across various domains and issues primarily due to their ability to systematically and uniformly capture knowledge.

Let’s consider an ontology focused on categorizing types of fruits. Within this ontology, we find terms such as “Fruit” in a general sense, along with specific subterms like “Citrus Fruit”, “Pommes” and “Berry,” each having precise and consistent definitions. Additionally, this ontology defines properties, such as the colour of the fruit or its taste.

For instance, if we aim to describe a real strawberry using the terms outlined in this ontology, we can assert that the strawberry falls under the category of “Berry,” specify its colour as red, and characterize its taste as sweet, among other attributes.

Figure 1 — Schema of an ontology focused on categorizing fruits

Now, imagine we have a table with columns representing the name of the fruit, its colour, and its taste. The rows contain concrete examples of fruits, such as strawberries, apples, and oranges. In this scenario, we can apply the terms from the ontology to annotate the respective columns, providing a structured and standardized way to describe and categorize these fruits.

Figure 2 — Example of using the ontology for annotation

Through the process of aligning the dataset with the ontology, we establish a structured and semantically enhanced data environment. Such an alignment allows researchers and data consumers to gain a clear understanding of each fruit’s characteristics. These attributes encompass the fruit’s classification, color, taste, and whether it falls into categories like citrus fruit or berry, as defined within the ontology.

This structured data holds significant value for tasks such as querying, analysis, and integration with other datasets, particularly when working with extensive and intricate data collections.

Research data management and Data management plans

Research data stands as a highly precious asset, and effective management is paramount in the realm of contemporary scientific investigation. When we refer to managing this data, we encompass the crucial steps of its accurate collection, secure storage, and seamless sharing. These actions are comprehensively outlined in documents known as data management plans (DMPs).

DMPs serve as comprehensive roadmaps, encapsulating the nature of data to be collected or generated, the methods and procedures for data handling, and specifics regarding its storage, security, analysis, documentation, citation, sharing, and accessibility.

For instance, in the realm of climate change research, data collection is instrumental in examining environmental trends and phenomena. Diverse sources and methodologies are employed to gather this data, such as satellite observations, weather stations, and climate models, which generate data through computer simulations. The resulting data from these disparate origins finds its home in specialized databases and repositories.

The data, stored and well-organized, is made available to others, often through public repositories or scientific journals. This sharing is pivotal for fostering future collaboration and verification of the research findings. Throughout this process, data management plans (DMPs) play a crucial role, addressing not only the technical aspects but also considerations of transparency, accountability, and ethical concerns related to data handling.

What term to use?

Nonetheless, the landscape of research data management offers a multitude of ontologies and vocabularies, each related to different aspects of research data management. This abundance can make it challenging to select the most suitable one for semantic annotation. The primary objective of this study is to deliver a thorough examination of the principal ontologies and vocabularies related to data management plans and research data management (RDM). Our aim is to uncover any intersections and relationships among these resources and identify potential novel associations that enhance the effective capture of semantic information.

Our approach

To begin, we initiated the process of identifying ontologies and vocabularies associated with data management. Our initial focus was on widely recognized ontologies and vocabularies in this domain, and we expanded this list by considering additional ones connected through interconnections or imports. Subsequently, we selected the nine most commonly employed ones and conducted an in-depth examination, visually mapping their concepts and relationships. Our goal was to analyse their content, identify overlaps, highlight any inconsistencies, and suggest new connections that would enhance the effectiveness of semantic annotation.

Our approach involved a very precise analysis of each term within these nine ontologies, coupled with an extensive search for analogous or potentially equivalent terms in other established ontologies.

Results

During the analysis of selected ontologies and vocabularies, we constructed a model that encapsulates their critical components. The model clearly shows that these chosen ontologies belong to a single domain, thereby facilitating their seamless connection through descriptive attributes.

Overlaps of ontologies

The total number of existing connections between different selected and analysed ontologies and vocabularies is 210, encompassing diverse types of relations. These connections involve links between terms originating from different ontologies or references to terms within another ontology. This number shows that ontologies and vocabularies in the field of RDM are very interconnected.

Figure 3 — The visualisation of the ontologies and vocabularies related to data management and their interrelationships

Definition of new relationships

During the analysis, we discovered many parts where the ontologies and vocabularies are overlapping. We identified over 30 areas where such overlaps exist, making them suitable for subsequent interconnection of related terms based on their definitions or semantic similarities.

In these instances of overlap, a lack of proper interconnection became apparent. For instance, we encountered four distinct terms across different ontologies, all representing the concept of an “image,” yet they remained isolated from one another. This situation presented users of these ontologies with a considerable challenge, as they faced the difficult decision of selecting the most appropriate term for their semantic annotation. The terms, while similar in meaning, lacked a clear connection.

Errors and inaccuracies

Throughout our analysis, numerous errors and inaccuracies were discovered. Notably, we encountered a situation where two distinct terms were defined for a person or organization within different ontologies, despite having identical meanings. Furthermore, although these terms were interconnected with other related terms, there was no explicit relation established between these two different terms to signify their equivalence. Which makes the ontologies inconsistent in their definitions.

Conclusion

In this work, we identified and analysed the main ontologies and vocabularies related to data management. Our efforts have culminated in a comprehensive model that encapsulates their contents, encompassing roughly 210 relationships between them. Throughout this analysis, we pinpointed more than 30 areas where additional connections could be established based on the meanings of the terms and the absence of proper linkages. Moreover, based on the thorough investigation and ontology matching, it would be appropriate to suggest the application of these chosen vocabularies and ontologies in combination, which would make a clear choice in the process of annotation in the field of data management. Additionally, various inaccuracies and errors within the analysed ontologies and vocabularies were discovered.

Our efforts provide the foundation for further potential research in order to make the data management plans and data management in general more effective and efficient.

Acknowledgements This work was supported by the Student Summer Research Program 2022 of FIT CTU in Prague.

--

--