AI companies are desperate for data and they’ll go to any length to find it
It’s important to have an idea of the scale of the data that companies working generative artificial intelligence algorithms, and some recent articles I’ve come across can really help.
The Verge has this piece, “OpenAI transcribed over a million hours of YouTube videos to train GPT-4", which gives some insight into the level of desperation involved in trying to obtain data when just about anything of value on the internet has already been used in datasets. Which is why it broke YouTube’s rules.
Given the rush to offer more features, the companies involved are taking the “better to ask for forgiveness than permission” route: if they get found out, they’ll cut a deal or pay the fine when their generative algorithms are well-trained.
Getting data is now such a priority that pretty much anything goes, as this article in The New York Times explains, “Four takeaways on the race to amass data for A.I.”, which breaks down in a visual representation all the data used to train ChatGPT3, illustrating the magnitude of the data obtained since 2007 all over the internet using crawlers, which represent about 410 billion tokens, compared to the 3 billion tokens represented by the entirety of Wikipedia. On the other hand, book scanning involves a pair of collections of 12 billion…