Generative AI: feeling overwhelmed? Here’s a recap

Published in

Venture Beyond

10 min readMay 9, 2023

I like to understand how things work. Throughout my readings of Generative AI, I’ve not found anything that explains in plain language how Generative AI works, how we got there, and what might await. So, I thought I’d write it myself (with some help of course!). Welcome to Generative AI 101.

I’ve designed this to be a two-part series.

Part 1 — Intro to Generative, is a basic introduction to the technology and how it came about. I introduce certain concepts relevant to the models themselves, the different types of models out there, and how we got to where we are today.
Part 2 — The Future of Generative AI, will go deeper into the broader implications, while looking at it from the lens of a VC. What might the world look like when everyone uses AI? Which companies will pull ahead, who will be left behind? Will we come to regret these new advances? Will anything really change?

It feels as if we are at a tipping point and the flywheel is taking off. This is no longer on the fringes, a tool reserved for the research labs. Generative AI is now a widely usable, everyday tool. We can thank OpenAI for productising an imperfect product, putting it in the hands of millions of people, and iterating fast on both the input/feedback and the product features.

So what is Generative AI?

Generative AI is the latest evolution in a branch of AI, deep neural networks, which leverages a new learning technique pioneered by Google: the transformer architecture. This new architecture, released in 2017, has revolutionised machine learning. Where AI used to be reserved for simple use cases with clearly labelled data, Generative AI is taking on tasks that we assumed would always be reserved to humans.

Generative models are trained on hundreds of billions of parameters (aka. the whole of the internet) and can be used to create text, images, videos, and audio in highly creative ways across thousands of different use cases. Creativity, the ability to synthetise, these are no longer ours alone.

Generative AI: A new kind of machine learning model

To understand Generative AI, we must understand the basics and shortcomings of ‘traditional’ machine learning models and the concept of inference, i.e. the process of using a trained model to generate new content.

In conventional machine learning, models are trained under supervision. During a supervised training phase, models are exposed to massive sets of labelled data and learn through trial and error. Once a model has been trained, it can be used to make predictions based on previously unseen inputs. A model trained on pictures of cats can accurately label an image of a Siamese as a cat.

This approach works well for many applications, such as image classification and speech recognition, where the input data can be clearly defined and labelled. However, these models come up short when it comes to tasks that involve scaling beyond labelled data and generating new content. We could not ask a conventional model to write a poem about cats or paint a cat in the style of Picasso.

*Prompt: Painting of a siamese cat in picasso’s cubist style. Cat is wearing glasses and painting a picture of a dog*

In contrast, Generative AI models can infer from an existing knowledge base of data and prompts to generate new content across multiple mediums, including text, images, video, and audio. They do so using a new architecture which allows models to learn differently, establish patterns and relationships in the data, and infer from unseen examples.

Faster & better: Using the transformer architecture

Despite progress in the field of computer vision, there was limited innovation in machine learning when it came to text, as models struggled with context and word order. That is, until researchers at Google, wanting to solve the problem of translating text, built a new model able to understand the entire context of a document at all times. Google’s model used a transformer architecture, which breaks down individual characters into tokens while using parallelised computation to speed up the processing of large amounts of data.

In language processing, a token is a sequence of characters that represents a single unit of meaning, such as a word or a punctuation mark. The transformer model breaks down written text into individual tokens, which it then uses to learn patterns and connections between different words and phrases. By breaking down text into tokens, the transformer model can better understand the structure and meaning of the text. The same can be done with images.

Parallelisation here is important because processing these tokens sequentially can take up a lot of time and computational resources. Parallelisation simply means performing multiple calculations at the same time. By doing this, models can process large amounts of data faster and more efficiently.

Suddenly, models were able to understand the structure and meaning of written text in a more flexible and sophisticated way. It enabled models to learn new tasks and understand new words and phrases with minimal training, stepping away completely from the large amounts of labelled data required by traditional machine learning.

Training the models: How generative models learn

Once built, these models still need to be trained. Just as conventional models are exposed to labelled datasets and learn on a supervised basis, generative models need to be exposed to data. The key differentiator here is that generative models, using the transformer architecture, can learn in an entirely unsupervised way.

Given the transformer architecture can learn to recognise patterns in written text more effectively than traditional rule-based approaches, models can be trained on unstructured (text, videos, sound) and unlabelled data. These models are trained using self-supervised learning, either one-shot or zero-shot learning, a concept which emerged in the field of computer vision.

With one-shot learning, a model is trained on a small number of examples and is expected to generalise on new, unseen examples. In the context of language processing, this means that the model can understand the meaning of a new word or phrase after seeing it only once or a few times. We no longer need massive datasets, but still need some.

Zero-shot learning takes this idea a step further. It refers to the ability of a machine learning model to perform a new task without any training examples for that task. In the context of language processing, this means that the model can understand the meaning of a new word or phrase even if it has never seen that word or phrase before. Here, a model can be trained on analogous but not directly relevant data.

Ultimately, transformer models trained using zero-shot learning can understand any knowledge from text or images encoded into tokens and create any number of things based off this logic, including art, code, video games, poetry, medical reports, chess…

Data: Confidently wrong

Despite moving away from labelled data, data remains a central component of these models. Large language models are trained on billions of parameters. They need to learn the patterns and structures of an entire language in order to generate coherent and accurate text.

While zero-shot learning and transformers result in incredible feats (GPT-4 can do your taxes), it does have certain limitations, notably in the way certain language models tend to ‘hallucinate’. From Wikipedia: “an AI hallucination is … a confident response by an AI that cannot be grounded in any of its training data”.

*Prompt: picasso style oil painting of a siamese cat hallucinating about doing his taxes*

Zero-shot learning models do not always perform as well as models that have been trained specifically for the task at hand. This is because the model, without having seen any examples of the specific task, instead relies on its understanding of related concepts to make predictions. Or as Ben Thompson puts it “think about what hallucination implies …: it is creation. The AI is literally making things up”.

When Google released its own model, Bard, during a very public demo, its first ever answer was factually inaccurate. Asked “What new discoveries from the James Space Webb Telescope can I tell my 9 year old about?”, Bard inaccurately mentions that JSWT took the very first pictures of a planet outside of our own solar system. Watchers were quick to point out that the first picture was taken 14 years before JSWT launched.

*Bard’s first answer was factually inaccurate. Image: Google*

Given the limitations of zero-shot learning, the broader the dataset, the more accurate and relevant the responses in a wide range of contexts and domains. Had Bard had more exposure to information about exoplanets and telescopes, perhaps it could have avoided this. Yet there are many risks here, especially as people start to use these models in their day-to-day jobs. This has important implications for the types of models and businesses that succeed or fail in this new world, which we will touch upon in Part 2.

Generative Models: A short list

While transformers have been shown to outperform other types of architectures in many natural language processing tasks, they are not the only option for building language models. And language models are not the only models out there either. There are plenty of other generative models out there with different capabilities, which have been handily summarised below by my trusted writing assistant, ChatGPT.

Variational Autoencoder (VAE): A type of neural network that is used for unsupervised learning of complex data. Commonly used to generate realistic images or to compress and generate new variations of input data.
Generative Adversarial Network (GAN): A type of neural network that generates data by training two models against each other. One model generates fake data, and the other tries to distinguish between real and fake data. GANs are commonly used for generating realistic images, video, and audio.
Auto-Regressive Models: A type of language model that generates text by predicting the next word in a sequence based on the previous words. Examples of auto-regressive models include GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers), which are widely used in natural language processing.
Transformer Models: A type of neural network architecture that is particularly well-suited for natural language processing tasks. Transformer models are based on self-attention mechanisms that allow them to learn and represent complex relationships between different parts of a sentence or document. Examples of transformer models include GPT-3 and T5.
Deep Convolutional Generative Adversarial Networks (DCGANs): A type of GAN that uses convolutional neural networks to generate realistic images. DCGANs are commonly used in image synthesis and style transfer.

How did we get here?

We’ve reached a crucial moment in history. We’ve all seen it and tried it. These models seemingly understand us and guided by our prompts, can generate creative poems, useful articles, artistic images, and can even help analyse a patients’ chart and identify the correct diagnosis. The variety of use cases possible with this new technology is literally mind expanding. Hundreds of companies have popped up within months, capitalising on these new models.

Getting there: Computing Power + New Models + Exabytes of Data

Why is this finally happening? The story here is that advances in cloud computing, machine learning research, and the decreasing cost of compute resources have all converged to produce this AI Renaissance.

At the infrastructure level, Generative AI wouldn’t be possible without the advances in cloud computing and hardware over the last decade. Cloud computing providers have spent billions to build and run the fastest and most scalable on-demand compute resources, competing for market share.

Meanwhile, companies such as Nvidia and Google developed new hardware capable of parallel computing, including graphic processing units (GPUs) and tensor processing units. While GPUs were initially used for gaming, they quickly gained adoption for use in AI, replacing less powerful computer processing units (CPUs). With more efficient hardware, the cost of training and running models has shrunk significantly.

Meanwhile, decreasing cost of compute resources made it possible for more researchers and developers to experiment with Generative AI. The launch of open-source models fuelled research, driving the development of more sophisticated neural network architectures, such as the transformer architecture.

As more people trained and experimented with large-scale models, the greater general understanding of how these models worked and how to improve their performance, further reducing the cost of training and running inference.

Finally, there has been a significant increase in the availability of large, high-quality datasets, particularly in natural language processing and image recognition. This has enabled researchers to train Generative AI models on more diverse and complex data, which has led to significant improvements in the quality and diversity of the output generated by these models.

Where do we go from here?

The speed of innovation and new product launches is increasing. OpenAI helped the technology reach mass adoption in just a month. Every day, companies launch new features using Generative AI. It’s getting hard to keep up.

What comes next is anyone’s guess but in Part 2, I take a closer look at the value chain, using it as a guide to help understand what awaits. Stay tuned!

Generative AI: feeling overwhelmed? Here’s a recap

So what is Generative AI?

How did we get here?

Where do we go from here?

Written by Blanche Ajarrista