What are we going to use to train algorithms with?

Published in

Enrique Dans

4 min readJul 20, 2023

IMAGE: Little robots in a classroom being trained — IMAGE: Alexandra Koch — Pixabay

Explaining the various factors involved in the complex process of obtaining data for training machine learning algorithms can be intricate yet highly fascinating. Before the launch of Dall·E, the first image-generative algorithm, in January 2021, the companies involved in its development basically did what they wanted, in a sort of Far West environment with no apparent legislation or frontiers.

Given that web scraping is, in principle, a legal practice, anyone can copy the content on publicly accessible pages, they harvested huge collections of tagged images and texts that they considered reasonably correct, and fed them into the databases they needed to train their products. The precedents for the issue were confusing: LinkedIn had lost several cases trying to stop other companies from web scraping its network data, but Facebook had won against Power Ventures, while Clearview’s activities prompted condemnation. Nevertheless, the idea, although subject to judges’ interpretation, was that web scraping was a tool, not a crime, and as with any tool, there were reasonable and unreasonable uses.

Then, companies like OpenAI and others broke into databases like Getty Images, getting their hands on millions of tagged images. All of them had a “Getty Images” watermark that could only be removed if you paid for the use of the photo, but it…

What are we going to use to train algorithms with?

Written by Enrique Dans