What does FAIR mean for data practices in AI?
By Anelia Kurteva, Paul Groth, Christine Kirkpatrick and Elena Simperl
Recent advancements in generative AI while promising, raise concerns about reliability, and safety necessitating responsible development and multidisciplinary collaboration. AI heavily relies on data. However, the process of preparing and using data for AI is complex and faces several socio-technical challenges related to data quality (ensuring data is clean, accurate, and free from errors or inconsistencies), dealing with bias, privacy and security (protecting sensitive information and complying with data privacy regulations) and the ethical implications of these. Experts across the world and at the Open Data Institute (ODI) have been working on various solutions and policy interventions to address these challenges.
In this piece Dr. Anelia Kurteva, a senior postdoctoral researcher at King’s College London, together with experts in the field of findable, accessible, interoperable, reusable (FAIR) [1] data and artificial intelligence (AI) — Prof. Paul Groth (University of Amsterdam), Christine Kirkpatrick (University of California San Diego) and Prof. Elena Simperl (ODI and King’s College London), discuss how having more FAIR [1] data can contribute to developing responsible AI (RAI) systems and the importance of responsible data management and stewardship.
One way to make data findable is through the utilisation of semantic technologies (e.g., ontologies, knowledge graphs) that naturally support knowledge discovery by bridging knowledge silos across different organisations, software agents and human experts [2][3]. Ontologies help establish a common vocabulary that harmonises the knowledge of different domain experts and their utilisation to semantically enrich raw data helps represent its context. Through the utilisation and assignment of URIs, each data point is transformed into a meaningful and uniquely identifiable thing (“not string”) that is easily findable. The added meaning (or semantics) to the data can help guide developers in finding and filtering the most suitable data to be combined as a dataset for AI. Having semantically enriched data in a findable machine-readable format helps AI derive further inferences and learn from mistakes. Studies [4][5][6] have shown that the utilisation of knowledge graphs can also improve the accuracy and explainability of AI’s decision-making in various cases when a dataset is already available, but the data does not have added semantics (e.g., raw data). A minimum viable solution could be the provision of a semantic (or metadata) description of the dataset as a whole, which not only provides an overview of the dataset contents but also supports its findability by both machines and humans through the use of URIs.
In this direction, the Croissant vocabulary [7] for making datasets ML-ready has been proposed by the MLCommons alliance. By providing detailed metadata descriptions of datasets across four layers of granularity (i.e. dataset, resource, structure, semantic), Croissant makes it easier for the ML community to use and share datasets responsibly. The integration of Croissant with existing popular repositories like HuggingFace, Kaggle, and OpenML has shown promising results (e.g. there are now over 400,000 datasets available in the Croissant format) [7]. To help users work with Croissant data, a free tool called the Croissant Editor, which supports users in validating, annotating, and even creating Croissant datasets from scratch has been developed. The community has also recognised the need for more descriptive metadata to support responsible AI (RAI) principles (e.g. transparency, accountability, privacy, security and reliability) and has already commenced the work on Croissant RAI extension. Pivotal step here is the planned reuse and alignment with existing standards such as the National Institute of Standards and Technology (NIST) AI Risk Management Framework and widely used vocabularies like Data Use Vocabulary (DUO), Data Privacy Vocabulary (DPV), and Data Catalogue Vocabulary (DCAT). Others such as the ESIP Data Readiness Cluster have proposed checklists for the consistent evaluation of data’s readiness for AI in terms of its access, quality and documentation.
However, findable data is not always easily accessible [1] to different entities (e.g., people, AI and organisations) online and offline. This can be due to the limited specification of access rights and in cases the underlying lack of established responsibilities regarding data’s lifecycle in AI ecosystems. It is important to set in place guidelines for accessing the data (e.g., authentication details and authorisation process), document and disclose them when needed. This helps facilitate both access to the data and its protection from being misused by unauthorised entities. Gaining access to the data before its use for AI training (i.e., during pre-processing) allows for its cleaning, normalisation and analysis (e.g., exploration of how it is distributed) that can help spot interesting patterns earlier and guide algorithm specification. This, however, is not the only point in time when having data easily accessible is desirable. During the training, testing and deployment of the model, as a result of errors (e.g., incorrect prediction, hallucinations) a full AI audit that requires access to both the data and the underlying algorithms might be needed. From a legal perspective, in most cases, individuals have the right to request companies to share with them all their (personal) data used for AI training as well [8]. This, however, is technically challenging for foundational models, which learn patterns across huge amounts of data. The fact that developers of such models pay limited attention to recording data provenance does not help either. Moreover, training and fine-tuning are complex processes, where multiple datasets are in full or in part filtered, processed, and combined. Most organisations have not been transparent about their AI data practices, under justification such as trade secrecy (i.e., protecting data as a trade secret) [9].
For many years now, ontologies such as PROV-O [10] and languages like Open Digital Rights Language (ODRL) [11] have been used to represent provenance information and define data access and usage rights in a way that can be understood and shared between different systems, even if they were created in different environments or with different tools. Having such information in a machine-readable format and accessible to trusted parties can facilitate better data transparency, but also ease regulatory compliance (as shown in [12][13]) by clearly defining the responsibilities of different entities, which ultimately leads to greater levels of accountability and helps build trust. When personal data is a concern, techniques such as synthetic data and the use of privacy enhancing technologies (PETs) such as federated learning have been proposed as viable solutions. We discuss synthetic data in more detail in ODI’s previous blog post on the topic, which includes a tutorial aimed at developers and the more code-savvy data practitioners. The ODI has already made advancements in this direction and has proposed a data ecosystem mapping tool and is currently investigating the maturity and usability of different PETs.
A prerequisite for all this is that data is also easily interoperable — it is represented in a format that ensures its clear and easy interpretation when shared between AI systems, people and organisations. Adding machine semantics to data is a common go-to solution that ensures data interoperability as one can define in detail the different types of data in a data set, their constraints and set relationships in place [14][15]. The Croissant vocabulary helps in this direction as well as it provides the needed semantics (or metadata) to enrich existing raw datasets with context at different granularity levels. Using Croissant, one can define not only high-level information about the dataset, such as its name, what it is about but also more concrete about its contents such as the files it comprises of their formats (Croissant supports metadata description of multimodal datasets) and the data within those files. Having such metadata ensures that the dataset remains relevant, usable and can be correctly interpreted in different contexts [7].
Facilitating findability, accessibility and interoperability of data also supports its wider reuse, which helps cultivate a more sustainable digital ecosystem. Existing public datasets that are easily findable and accessible online and have clear descriptions that help interpret them are much more likely to be reused for training. Different datasets (e.g. Stanford Question Answering Dataset (SQuAD)) have already been established as benchmarks for evaluating AI’s performance. The reuse of such benchmarks supports the standardisation, comparability and fair evaluation between different AI systems. Moreover, the reuse of insights or results gathered from previous dataset studies can help uncover new research directions and help accelerate AI developments and foster new collaborations. However, it is important to also consider the possibility of inheriting the bias existing benchmark datasets carry and derive proper strategies for dealing with it.
The licences attached to each dataset are another thing to consider with reuse as they govern how the dataset can be utilized, disseminated, and adapted. They serve a twofold purpose: safeguarding the intellectual property of the authors while supporting open science and technology innovation. Reuse also plays a role in supporting sustainability in technology development. AI’s advancements and its widespread adoption are further resulting in a rising demand for electricity and hardware manufacturing (e.g., data centres to store data; and servers to remotely train AI) [16]. This motivates the further reuse and adaption of existing AI models and datasets and calls for their responsible development in terms of sustainability as well. Having more FAIR data, along with clearer model and data documentation(s) can bring us a step closer to realising this goal.
In summary, by adopting FAIR data principles, we can make data more discoverable through clear metadata and indexing, accessible to a wider audience with standardized access protocols, interoperable for different AI systems and people, and reusable for various purposes with appropriate licensing and documentation. This ultimately fosters a more efficient, transparent, and ethical AI development and cultivates an ecosystem where the community can build upon existing work, accelerate innovation, and ensure the reliability and robustness of the inner workings of AI.
References
[1] Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E. and Bouwman, J., 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3(1), pp.1–9.
[2] Fensel, D., 2001. Ontologies. In: Ontologies: A silver bullet for knowledge management and electronic commerce. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-04396-7_2
[3] Benjamins, R., Fensel, D. and Gómez-Pérez, A., 1998. Knowledge management through ontologies. CEUR Workshop Proceedings (CEUR-WS. org).
[4] Chhetri, T.R., Kurteva, A., Adigun, J.G. and Fensel, A., 2022. Knowledge graph based hard drive failure prediction. Sensors, 22(3), p.985.
[5] Tiddi, I. and Schlobach, S., 2022. Knowledge graphs as tools for explainable machine learning: A survey. Artificial Intelligence, 302, p.103627.
[6] Gaur, M., Faldu, K. and Sheth, A., 2021. Semantics of the black-box: Can knowledge graphs help make deep learning systems more interpretable and explainable?. IEEE Internet Computing, 25(1), pp.51–59.
[7] Akhtar, M., Benjelloun, O., Conforti, C., Gijsbers, P., Giner-Miguelez, J., Jain, N., Kuchnik, M., Lhoest, Q., Marcenac, P., Maskey, M. and Mattson, P., 2024, June. Croissant: A Metadata Format for ML-Ready Datasets. In Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning (pp. 1–6).
[8] Information Commissioners Office (ICO), 2023. How do we ensure individual rights in our AI systems?, Available at https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/how-do-we-ensure-individual-rights-in-our-ai-systems/
[9] Attamongkol, T., 2023, Promoting the Transparency of AI-Generated Inferences, In: The Quest for AI Sovereignty, Transparency and Accountability, FGV RIO Law, Available at https://direitorio.fgv.br/en/publication/quest-ai-sovereignty-transparency-and-accountability
[10] Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S. and Zhao, J., 2013. Prov-o: The Prov ontology. W3C recommendation, 30.
[11] De Vos, M., Kirrane, S., Padget, J. and Satoh, K., 2019. ODRL policy modelling and compliance checking. In Rules and Reasoning: Third International Joint Conference, RuleML+ RR 2019, Bolzano, Italy, September 16–19, 2019, Proceedings 3(pp. 36–51). Springer International Publishing.
[12] Kirrane, S., Villata, S. and d’Aquin, M., 2018. Privacy, security and policies: A review of problems and solutions with semantic web technologies. Semantic Web, 9(2), pp.153–161.
[13] Jesus, V. and Pandit, H.J., 2022. Consent receipts for a usable and auditable web of personal data. IEEE Access, 10, pp.28545–28563.
[14] Bittner, T., Donnelly, M. and Winter, S., 2005. Ontology and semantic interoperability. In Large-scale 3D data integration (pp. 139–160). CRC Press.
[15] Liyanage, H., Krause, P. and De Lusignan, S., 2015. Using ontologies to improve semantic interoperability in health data. BMJ Health & Care Informatics, 22(2).
[16] Electric Power Research Institute (EPRI), 2024. Powering Intelligence: Analyzing Artificial Intelligence and Data Center Energy Consumption, Available at https://www.wpr.org/wp-content/uploads/2024/06/3002028905_Powering-Intelligence_-Analyzing-Artificial-Intelligence-and-Data-Center-Energy-Consumption.pdf