Greed, short-termism, and algorithm training: the evolving AI data landscape

Enrique Dans
Enrique Dans
Published in
4 min read2 days ago

--

IMAGE: An illustration of a person with a greedy aspect protecting and restricting a treasure trove of data that could be potentially used for algorithm training

The Data Provenance Initiative, a global collective of volunteer artificial intelligence researchers, has published a thought-provoking academic study titled “Consent in crisis: the rapid decline of the AI data commons.” This research reveals a startling trend: of the thousands of domains collected from major web data repositories used for training generative algorithms, 5% of all data and a staggering 25% of data from high-quality sources have already been subject to restrictions through specific clauses preventing their use.

An extensive audit of over 14,000 crawlable web pages demonstrates how consent preferences for the use of such data are evolving into increasingly closed and restricted models. There’s a strong proliferation of specific clauses referring to use for algorithm training, marked differences in restrictions on artificial intelligence developers, and general inconsistencies between the intentions expressed by websites in their terms of service and in their robots.txt file. This shift signals a significant change in how data owners perceive the value and potential uses of their online content.

I’ve observed such restrictions affecting my own content, particularly that licensed from third parties. These licensors are now asking me to specify in my licensing terms — which in…

--

--

Enrique Dans
Enrique Dans

Professor of Innovation at IE Business School and blogger (in English here and in Spanish at enriquedans.com)