Tracking Machine Learning Data

In the data-centric AI era, where will data originate and is it reliable? Here are three key considerations.

MIT IDE
MIT Initiative on the Digital Economy
7 min readJul 8, 2024

--

By Irving Wladawsky-Berger

Given the explosive growth of AI — and the advent of foundation models, including generative AI, large language models, and chatbots — demand for data keeps increasing. No matter what an organization hopes to achieve, success is impossible without ready access to high-quality data. But as data becomes increasingly critical for business success, how are enterprises adjusting to the increasing importance of data? Where is all the data needed to train AI models going to come from? And what are the potential legal and ethical issues that companies need to watch out for?

To explore these important questions, I moderated a panel on The Data Provenance Dilemma at the 2024 MIT Sloan CIO Symposium on May 14. The premise was that “Generative AI models are highly dependent on the quality and quantity of data used in their training. But, many of these models are being trained on vast, diverse, and inconsistently documented datasets that have been raising serious concerns about the legal and ethical risks involved.”

The panel included Mike Mason, chief AI officer of the technology consultancy Thoughtworks; Shayne Longpre, a doctorate candidate at the MIT Media Lab; and Robert Mahari, who’s pursuing a joint JD-PhD degree at the Harvard Law School and MIT’s Media Lab.

Irving Wladawsky-Berger with Robert Mahari at the CIO Symposium

Let me summarize three key points we discussed.

1.General-purpose (GenAI) versus task-specific (deep learning) AI models

We discussed the differences between large general purpose foundation models and task-specific deep learning (DL) models. One major difference is that unlike the task-specific training of earlier DL models, foundation models are trained with much larger volumes of data, and use transfer learning to take the knowledge learned from training one task and apply it to a variety of related tasks.

Another important difference is that with domain-specific DL models, you need to find the data needed to train a model for the specific task you’re after, whereas foundation models like OpenAI’s ChatGPT and Google’s Gemini are available from vendors, and can then be adapted with domain-specific fine-tuning.

Mason noted that with the advent of the big data revolution in the 2000s, organizations realized that they needed to start saving every scrap of data to train their AI models, including structured and unstructured data, clickstreams that record how users navigate websites, i.e., the so-called digital exhaust, as people move around the internet.

Mason added that there are instances where a relatively small, well trained, task-specific model gives better results than a large, general purpose GenAI model. “We did some work with a DL-based application for a fashion and apparel retailer that recommends articles as clients are browsing their online catalog, and their existing system was so well trained with many years of data, and its algorithms were so highly tuned that the new GenAI model actually performed worse.

Depending on what you’re trying to accomplish, you may be better off with a highly tuned domain-based DL model than with a very general model.”

Longpre said that while general purpose models can be applied to just about anything, they can also be misused in many different ways, especially if people have too much confidence on the models and don’t carefully review the outputs they generate. GenAI models can be so compelling that you may be psychologically convinced that their response is really great. But in fact, the model might not have correctly interpreted your prompt. It might generate so called hallucinations, that is, false or misleading information. We need to learn how to properly use these models, but literacy on how to use them appropriately is still lagging.

2. The economic value of data

Mason added that while everybody realizes that data is a valuable business asset, there’s so much fuzziness in the data that’s gathered by most organizations that it’s very difficult to measure their economic value and thus properly estimate the return-on-data to the business.

We need concrete metrics to better understand the value of specific AI use cases, such as resolving customer service queries faster and more accurately, or using AI to improve the productivity of software developers. The problem is that measuring people’s productivity is actually quite difficult unless you have a group of people who are solving the same problem. Then you would compare who did a better job this time than last time thanks to some new tool. There’s a lot of subjectivity in such comparisons.

Enterprises are highly complex systems. You cannot think about how to best organize a company to take advantage of AI, the way we organize better understood technological or engineering systems. It’s important to take an iterative Agile approach by conducting experiments with AI use cases, where you take an incremental small change, measure how well it worked and whether to continue in the same direction or do something different.

3. Legal and ethical issues

Mahari, along Longpre and several other researchers, have been tracking the data sets used in training AI models to try to figure out their provenance. Who made them and why did they make them? Are they reliable or biased and toxic? They’re trying to understand AI as kind of supply chain social process that includes different actors with different incentives. “I think it’d be helpful, certainly for academics, but also in certain business contexts to try to understand how to use AI without accidentally violating regulations or otherwise exposing yourself to risk.”

Longpre and Mahari are co-founders of the Data Provenance Initiative (DPI), a multi-disciplinary volunteer group that includes legal and technology experts. The DPI conducted a large-scale audit of AI datasets used to train large language models in order to improve the transparency, documentation, and informed use of AI datasets. The initial results of their audit were published in a November, 2023 arXiv paper.

“The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners,” said the paper in its Abstract. “To remedy these practices threatening data transparency and understanding, we convened a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We developed tools and standards to trace the lineage of these datasets, from their source, creators, license conditions, properties, and subsequent use.”

Their analysis found “frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+. The results point to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. … As a result,

much of this data is risky to use (or harmfully misleading) for practitioners who want to respect the data provenance of a work.”

Companies developing large language models (LLMs) have been scraping huge amounts of data from websites across the internet. This raises serious copyright issues concerning the ownership of the data used to train LLMs and other large AI models. For example, in December the New York Times sued Open AI and Microsoft for copyright infringement. A number of artists have sued Stability AI and other companies claiming that their Stable Diffusion image generation AI tool was trained on billions of copyrighted images scraped from the internet without compensation or consent from the artists. And, a number of images in the LAION-5B dataset had to be taken down because they contained problematic pornographic and racist content.

LLMs have also been trained using data sets sourced from data repositories like Hugging Face. The DPI analysis found that over 60% of the data set licenses in those repositories where either omitted or incorrect, so the data was being used under wrong assumptions about the original intention of the license.

In order to be legally and ethically compliant, it’s essential that businesses know the source of the data used to train their AI models.

Mahari further explained that the illegal use of data sets to train generative AI models has often taken place under the guise of fair use, a principle in U.S. law that permits limited use of copyrighted material without having to acquire permission from the copyright holder. Fair use includes the non-commercial use of the copyrighted material for purposes of education, scholarship, research, criticism, comments, and news reporting. However, fair use is unlikely to apply to the use of data sets for training AI models for commercial use.

Earlier this year, Mason’s company, Thoughtworks, published a report on “Modernizing Data with Strategic Purpose,” which nicely explained the critical value of data in the emerging AI era. And, at the 2024 MIT IDE Annual Conference that took place the week of May 20, Longpre and Mahari gave keynotes on Data Provenance for AI and Regulation by Design respectively.

In conclusion, we don’t just need more data; we need high quality data, that’s appropriate for its intended purpose as well as legally and ethically compliant.

This blog first appeared here on June 27, 2024.

--

--

MIT IDE
MIT Initiative on the Digital Economy

Addressing one of the most critical issues of our time: the impact of digital technology on businesses, the economy, and society.