Part 1 — How Many Cat Pictures? Does AI Really Need Big Data?

Published in

Autonomous Agents

8 min readJun 13, 2024

In the realm of artificial intelligence, there has been a longstanding belief that big data is essential for effective learning and model development. This notion has permeated the discussions of applied AI engineers, investors, and data scientists, often leading to an echo chamber effect where “AI needs big data” is chanted with conviction. However, as we approach 2024, it’s crucial to reassess this belief and recognize that advancements in AI research and techniques have shifted this paradigm.

This is a two-part series (for now)

Starting with a Hammer in Hand?!

When mathematicians or physicists approach a problem, they don’t start by questioning whether there is enough data to learn about the underlying phenomenon. Instead, they adopt a more structured approach that involves:

Strictly defining the phenomenon in terms of a desired state.
Identifying the inputs and transitions that lead to this desired state.
Establishing the governing function that approximates a solution for the transition from one state to another.
Capturing constraints in the inputs and outputs as a Lagrangian.

You then build a model that satisfies these conditions by respecting the constraints in the available data. You can fill in imputations by capturing the underlying dynamics and polynomials using a small sample of data with the desired variance to approximate the problem. In other words, you spend time understanding the behavior and constraints of the problem and then “invent” an appropriate model to solve it.

Conversely, Applied AI engineers seem to start with a model in hand and then try to satisfy the needs of the model! I am not exaggerating this. I literally come from every panel, key notes, investor conversations and meeting I am invited to where Heads of AI engineering to Investors to fresh interns start with “LLMs” in hand and mindlessly chant “We need more data, we need more data.” It is quite perplexing to see how the majority is in sync and part of this echo chamber.

We put people on the moon in 1969 without LLMs for God’s sake! Come on folks!

The Essence of Capturing the Governance Function

To better understand this shift, let’s delve into the mathematical underpinnings of the approach mathematicians take. Suppose we have a system characterized by states x∈R and inputs u∈R. The desired state can be defined by a state vector x_d. The transition from one state to another can be described by a function f:

The goal is to find an input u(t) that drives the system from an initial state x_0 to the desired state x_d. This can be formulated as an optimization problem where we minimize a cost function J:

subject to the system dynamics:

and boundary conditions:

The Lagrangian L encapsulates the constraints and objectives, often taking the form:

where Q and R are weight matrices that balance the state error and control effort.

Relevance to AI and Data Requirements

The nature of the desired state and the problem context should dictate the type of data required. Not the other way around.

For example, In supervised learning, the probability distribution in the input set should contain the highest variance in the minimal amount of data that captures the essence of the state (e.g., a diverse set of cat pictures). In contrast, unsupervised learning focuses on exposing the contours and topography of the data landscape, making use of whatever data is available.

Supervised Learning

In supervised learning, the variance in the input set plays a crucial role. Suppose we have a set of input data

and corresponding labels

The goal is to find a mapping f such that:

where ϵ_i is the noise term. The model f is typically learned by minimizing a loss function L:

where θ are the parameters of the model and ℓ is a loss function such as mean squared error (MSE):

The effectiveness of the learning process depends on the variance and distribution of x_i and NOT the size of x_i. For example, if the input data consists of cat pictures, the dataset should include various breeds, poses, lighting conditions, and backgrounds to ensure the model captures the essence of what constitutes a “cat.” And this can be just a stratified sample today.

It does not have to be tens of thousands of cat pictures as was needed for AlexNet in the past. Today’s models, like few-shot learners such as OpenAI’s CLIP or meta-learning approaches like Model-Agnostic Meta-Learning (MAML), can take just 5 or 10 cat pictures and learn everything about a cat. These advanced models leverage transfer learning, self-supervised learning, and meta-learning techniques to generalize from minimal data effectively.

Unsupervised Learning

In unsupervised learning, the focus is on understanding the underlying structure of the data. Given a dataset x_i, the objective is to find a representation zi=g(xi) that reveals the data’s structure. Simpler techniques such as clustering, principal component analysis (PCA), and autoencoders are commonly used.

For example, consider clustering with K-means, which aims to partition the data into K clusters by minimizing the within-cluster variance:

where μk is the centroid of cluster C_k. The algorithm iteratively updates the cluster assignments and centroids until convergence. The resulting clusters expose the data’s topography, highlighting distinct groups within the dataset.

Advanced Techniques with Minimal Data

Recent advancements in AI have introduced techniques that require minimal data while still achieving high performance. Here are a few examples:

Transfer Learning

Transfer learning involves taking a pre-trained model on a large dataset and fine-tuning it on a smaller, task-specific dataset. This approach leverages the knowledge already captured in the pre-trained model, significantly reducing the amount of data needed for effective learning.

Mathematically, let θ_0 be the parameters of the pre-trained model and θ the parameters after fine-tuning. The objective is to minimize a new loss function L on the smaller dataset:

where M is the number of samples in the smaller dataset. The initial parameters θ_0 serve as a starting point, allowing the model to adapt quickly with fewer data samples.

For example, a convolutional neural network (CNN) pre-trained on ImageNet can be fine-tuned to recognize cat breeds with just a few labeled images. The pre-trained model has already learned general features such as edges, textures, and shapes, which are useful for many vision tasks.

Few-Shot Learning

Few-shot learning aims to train models that can generalize well from a very small number of examples. This is often achieved through meta-learning, where the model is trained on a variety of tasks so that it can quickly adapt to new tasks with minimal data.

One popular approach is the Model-Agnostic Meta-Learning (MAML) algorithm. The MAML framework seeks to find a set of model parameters θ that can be adapted to new tasks using only a few gradient updates. The meta-objective is:

where T_i represents a task sampled from the task distribution p(T), and α is the step size for the gradient update. The inner term represents the adaptation to a specific task, and the outer term aggregates the loss over multiple tasks.

Self-Supervised Learning

Self-supervised learning utilizes unlabeled data to create labels through inherent data structures. This method significantly reduces the dependency on labeled data. Techniques such as contrastive learning, where the model learns to differentiate between similar and dissimilar pairs of data, have shown great promise.

For instance, in natural language processing, models like GPT-3 are trained using vast amounts of text without explicit labels. The model learns to predict the next word in a sentence, effectively learning language patterns and structures. Formally, given a sequence of words w1,w2,…,wN the objective is to maximize the likelihood:

Data Augmentation and Synthetic Data

Data augmentation techniques artificially increase the size of the training dataset by applying transformations such as rotations, translations, and scaling to the existing data. This helps in creating more diverse training examples without the need for additional labeled data.

Mathematically, let x_i be an original data point and T a transformation. The augmented data x~_i is generated as:

Synthetic data generation involves creating artificial data that mimics real-world data. This can be particularly useful in scenarios where collecting real data is expensive or impractical. Techniques like Generative Adversarial Networks (GANs) can be used to generate realistic images, text, or even tabular data.

In GANs, a generator G learns to create data samples x~ from a noise distribution z ∼ p_z, while a discriminator D tries to distinguish between real data x and generated data x~. The objective is a minimax game:

By iteratively updating G and D, the generator learns to produce realistic data samples that can be used to augment the training dataset.

Conclusion

The belief that AI needs vast amounts of data is evolving. Modern approaches emphasize the quality, variance, and structure of the data rather than sheer volume. By adopting more sophisticated mathematical and computational techniques, we can achieve remarkable results with less data, aligning AI research more closely with the methods used in physics and mathematics.

Part 1 — How Many Cat Pictures? Does AI Really Need Big Data?

This is a two-part series (for now)

Starting with a Hammer in Hand?!

The Essence of Capturing the Governance Function

Relevance to AI and Data Requirements

Supervised Learning

Unsupervised Learning

Advanced Techniques with Minimal Data

Transfer Learning

Few-Shot Learning

Self-Supervised Learning

Data Augmentation and Synthetic Data

Conclusion

This is a two-part series (for now)

Written by Freedom Preetham