Real Data Won’t Be the Future of Model Development

CRV
Team CRV
Published in
5 min readApr 3, 2024

By Brian Zhan and Max Gazor

In the evolving landscape of AI, where the past few years have seen leaps from image generation marvels to the advent of increasingly sophisticated language models, the next wave of innovation just might redefine our expectations yet again. Synthetic data has been simmering beneath the surface gaining momentum and increasing in sophistication.

Synthetic data, in essence, refers to data generated by algorithms rather than being directly collected from real-world interactions. This concept is not new; it has roots stretching back through the history of natural language processing (NLP) and machine learning (ML). Historically, synthetic data has been closely associated with data augmentation techniques, such as back-translation in NLP, where data is slightly modified to introduce diversity. Today’s ambitions for synthetic data go beyond mere augmentation. Companies like Anthropic are pioneering the use of synthetic data to create AI that is not only aligned with human intentions, but also operates with a level of autonomy and adaptability unlike anything we’ve seen before.

The rationale behind this shift towards synthetic data is both pragmatic and strategic. As AI models grow in complexity and capability, the appetite for data to train these behemoths follows suit.

The diminishing returns on traditional data sources have set the stage for synthetic data to emerge as a critical element in scaling AI models further. Synthetic data offers a way to generate an almost infinite variety of data points, tailored to expose AI models to a broader spectrum of scenarios, nuances and edge cases than real-world data can feasibly offer.

In general, ML engineers use synthetic data to improve models in the following ways:

  1. Creating more instructions: Researchers start with a small set of instructions they create themselves. They use a computer program to generate more instructions that are similar to the ones they started with. This process creates a larger set of instructions for the computer program to learn from during training.
  2. Figuring out the best answers: Researchers collect a lot of questions or incomplete sentences from different places, like from people or from existing collections of information. For each question or incomplete sentence, a computer program comes up with several possible answers or ways to complete the sentence. Another program, or sometimes humans, look at these possible answers and choose the best one for each question or incomplete sentence. The chosen answers are kept as examples of what the program should aim for when answering similar questions in the future.
  3. Improving responses: A computer program generates an answer to a given question or prompt. Another program, which has been trained to follow specific rules or guidelines, looks at the answer and gives feedback or suggestions on how to make it better. This second program modifies the original answer based on its suggestions. The original answer and the improved answer are kept as an example, where the original is considered not quite right and the improved version is considered better. These examples can be used to help the first program learn to give better answers in the future.

One of the most compelling implementations of synthetic data is Anthropic’s Constitutional AI (CAI) method, utilized in their Claude models. Constitutional AI (CAI), involves two key steps that leverage synthetic data:

  1. First, they have a list of rules or principles, like “don’t encourage violence” or “always tell the truth.” They use these principles to double-check the answers their AI gives. If an answer doesn’t follow the rules, they make the AI try again until it gets it right. Then, they use all these corrected answers to teach the AI to do better.
  2. Second, they create pairs of answers and ask another AI model to pick which one is better, based on a randomly chosen principle from their list of rules. They use this to create more synthetic data, which they then use to train their AI using a technique called Reinforcement Learning from AI Feedback.

This transition to reliance on synthetic data is evident across the AI ecosystem, from boutique open model providers fine-tuning on synthetic datasets to the creation of complex models that challenge the status quo of AI capabilities.

The cost differential alone — where generating synthetic datasets is exponentially cheaper than acquiring comparable human-generated data — makes it an irresistible option for both incumbents and emerging startups.

Yet the journey of synthetic data from a useful tool to a cornerstone of AI development is not without its challenges. Questions around the quality, diversity and representativeness of synthetic data persist. Moreover, the technical sophistication required to generate and utilize synthetic data effectively means that its benefits are not universally accessible. This has created a bifurcated landscape, where entities like Anthropic leverage synthetic data for groundbreaking advancements in AI robustness and alignment, while others are still navigating the complexities of integrating synthetic data into their workflows.

The implications of synthetic data’s rise are profound. Beyond enhancing model performance and enabling the training of more powerful AI systems, synthetic data has the potential to democratize AI development.

As we stand on the cusp of this new frontier, it’s clear that synthetic data is not just another incremental step in the evolution of AI. It represents a paradigm shift, offering a path to overcome current limitations and unlock the untapped potential of AI systems. The journey of integrating and optimizing synthetic data into AI development is just beginning, but its trajectory is unmistakably set to redefine what is possible in the field. For investors, developers and visionaries alike, synthetic data heralds a future where the constraints of today become the stepping stones of tomorrow’s AI breakthroughs. If you’re exploring and building in this new synthetic data frontier, our team would love to hear how it’s going and where you think the space is heading.

--

--

CRV
Team CRV

CRV is a VC firm that invests in early-stage Seed and Series A startups. We’ve invested in over 600 startups including Airtable, DoorDash and Vercel.