The Intersection of the EU AI Act and AI Chain of Custody

Shawn Flaherty
Tranquil Data
Published in
4 min readAug 17, 2024

AI chain of custody refers to the management of data throughout its lifecycle, ensuring integrity, security, traceability, and compliance with regulatory and business requirements. As AI technology becomes increasingly prevalent across industries, establishing a robust AI chain of custody is essential for deploying safe AI solutions that achieve positive business outcomes.

Given the vital role of the chain of custody in creating healthy AI systems, it’s unsurprising that new regulations are being drafted worldwide that incorporate strong chain of custody provisions. One such regulation is the EU AI Act, which aims to “ensure the safe and ethical development and deployment of artificial intelligence (AI) technologies within the European Union.”

This article discusses the EU AI Act’s chain of custody requirements, including the challenges companies will face in meeting the specific requirements of the rule, and ensuring that AI efforts are ethical and align with business objectives.

Data Collection: See EU AI Act Article 10(3): Documentation requirements ensure transparency and traceability

Data collection measures are the first step in the AI chain of custody, and are central to the EU AI Act. Specifically, collection governance involves meticulously tracking the origins of the data, the methods of collection, and ensuring that the necessary permissions or licenses are obtained. Organizations must be able to trace and audit where data originated and the rules for its eventual downstream use.

Data lakes are designed to store vast amounts of structured and unstructured data in its raw format, making them an ideal repository for the diverse datasets needed for AI and machine learning applications. They enable organizations to collect data from various sources, such as transactional databases, social media, and sensors, and ingest all this data into a single location. This centralized storage facilitates the training of AI models on large, diverse datasets, which is essential for developing robust and accurate AI systems.

Unfortunately, the AI chain of custody becomes challenging when data is ingested into a data lake. In such environments, the critical context surrounding data — such as its source, collection methods, and rules for its use — can easily be obscured or lost. Traditional approaches to documentation and access control do not scale well, leading to a broken chain of custody and risks in data quality, accuracy, and ethical use.

Preprocessing: See EU AI Act Article 10(3): Documentation requirements ensure transparency and traceability & See Article 13(1): The need for risk management, including the ongoing monitoring of AI systems

Preprocessing is a critical stage in the AI chain of custody, involving the transformation and preparation of raw data into a format suitable for training AI models. This stage includes various activities such as data cleaning, normalization, transformation, and augmentation.

These preprocessing activities often lead to the loss of critical context. For example, during data cleaning and normalization, important metadata about the source and collection methods of the data can be stripped away. Transformations that alter the data’s structure can obscure its original form and lineage, making it challenging to trace back to its origin. Data augmentation can introduce synthetic elements that, while beneficial for model performance, may complicate the understanding of what constitutes permissible use.

Data Usage: See EU AI Act Article 10(4): Human oversight to ensure appropriate use & Article 14: Record-keeping requirements for data access and usage logs

Data usage is a pivotal aspect of the AI chain of custody, encompassing how data is accessed and used throughout its lifecycle. Common safeguards to ensure proper use include documentation (e.g., lawyers drafting guidelines for permissible use), training (e.g., training engineers on the rules), and access control measures to ensure that only those who have been trained can access the datasets. While access control systems restrict who can view or interact with data, they often fail to ensure that data is used in accordance with its intended purpose. This is because, as explained in the collection and preprocessing stages, the data used by engineers to build AI models lacks the context (e.g., rules) associated with it. This leaves engineers with no clear mapping of which rules apply to which data elements. The second common point of failure (assuming that context is attached to the data) is that access control relies on human memory and the correct interpretation of usage rules. This is especially acute for AI systems, where the volume of data and the number of associated rules create a complex matrix that defines permissible use.

The most common examples of misuse include using a dataset that has been consented to for a specific purpose for a different purpose, using third-party data for purposes beyond what data rights provisions in the contracts permit, and using data for purposes that violate state, federal, or international laws.

Conclusion

As organizations tackle the complexities of data collection, preprocessing, and usage, prioritizing transparency and robust governance is essential to meet regulatory standards and ensure responsible AI deployment throughout the data lifecycle. Given that manual interventions fall short in maintaining the chain of custody, there is a pressing need for new infrastructure that captures data context before aggregation, guarantees correct use, and provides transparency to demonstrate compliance with all requirements. The missing infrastructure is the software we’ve built at Tranquil Data, which is becoming a core component of next-generation AI.

--

--