LLMs run out of data: what BigTech are doing, synthetic data anyone?

Published in

Predict

2 min readApr 9, 2024

It is not a secret that large language models (like chatGPT) consume a huge amount of data of various kinds.

To put in perspective, GPT3 model (175B parameters) training set includes the whole Wikipedia English (22GB) as a fraction of the total 45 TB of data.

At the Sohn conference in May 2023 , Sam Altman (OpenAI) explicitly mentioned AI will run out of (publicly available) data.

Given the trend of data consumption, the natural solution was, and still is, generating data i.e. synthetic data.

Not to brag but back in 2018, I already wrote about the ascend on synthetic data and why it was something we will hear more in the future.

As it was my case in 2018, I had clients who did not have data of the necessary quality yet we needed to deliver a functioning AI model to them.

We then first developed an AI model to generate the data, then use that to develop the anomaly detection model for the clients.

Necessity is really the mother of all invention.

But back to what BigTechs are doing today.

They have first tried to loosen up the privacy policies to allow the usage of generated text into their models.

For example Google can now use the text inside the Google docs from users or youtube videos to train its models.

Meta does not have this privilege and they are reportedly trying to buy a large publisher to access long texts.

All these attempts are clearly temporary fixes.

If to improve the performance we need more quantity of data, not quality then we need to generate more data.

But how?

What BigTechs or Labs are doing now is simply using LLMs to generate more text or videos or images.

In plain English, the main two techniques at play are

-adversarial models: a model produces the text (or image) while the adversary will judge/correct the text

-models ‘somehow’ supervised by humans when the output is wrong (called technically Reinforcement learning from human feedback).

These techniques will definitely work… to an extent. But nobody knows exactly ‘how far’.

Can an AI judge an AI that generates data?

If yes, we are going ‘far’.

If not, we are going ‘less far’.

While I am an enthusiastic proponent of synthetic data in many many AI applications (cyber above all), I am not entirely convinced you can build generalist models just ingesting huge amounts of unverified synthetic data.

As I said before, the ‘quantity approach’ to improve AI performances may reach its limits soon if not already. The ‘quality’ of data or models may lead the next wave of performances.

#ai #artificialintelligence #business #technology #data #innovation

LLMs run out of data: what BigTech are doing, synthetic data anyone?

Written by Andrea Isoni