The great AI gold rush is on, but is this what we really want?

Published in

Enrique Dans

3 min readMay 13, 2024

IMAGE: A diagram of the data sources that OpenAI used to train ChatGPT3, spanning from scanned books, to Wikipedia, to the so called Common Crawl, a huge database with text collected from the internet

The best reading from over the weekend for those passionate about innovation and technology may be this article in The New York Times, “How tech giants cut corners to harvest data for AI”, a follow-up on one it published a month ago, “Four takeaways on the race to amass data for AI”, from which I have taken the accompanying diagram.

With Big Tech locked in a gold rush to acquire as much data as it can as quickly as possible, it’s worth thinking about the extent to which this means the privatization of data; if it really makes sense to have a range of companies competing for training information, or whether they should all have open source access to databases.

The accompanying diagram spectacularly illustrates the question: Wikipedia, with all its millions of articles, is the tiny rectangle of the upper right corner. The bulk of the set is the database compiled by Common Crawl, a nonprofit organization that has been archiving web content with monthly updates since 2008. It contains around 46% of information in English, followed by Russian (6.03%), German (5.4%), Japanese (5.15%), Chinese (5.07%), Spanish (4.53%), French (4.39%) and many others already below 3%. In fact, criticism that developing models in languages other than English is some kind of chauvinism or parochialism is unfounded: it is very important to…

The great AI gold rush is on, but is this what we really want?

Written by Enrique Dans