AI and the dawn of motherships

The marginal cost of intelligence is trending to zero

8 min readDec 2, 2022

Introducing the motherships

A new AI era has begun. One in which ML algorithms go from narrow subject matter specialists to jacks-of-all-trade. Ushering in the metamorphosis of ML model to AI platform. While the buzzwords of AI stayed the same, a lot has changed under the hood. We’ll explore what changed and the potential implications.

Motherships or foundation models as the industry calls them, have been getting a lot of attention lately, especially OpenAI’s GPT-3. By experimenting with scale and universal loss functions, these models started to show impressive and somewhat unexpected emergent general knowledge - becoming “jacks-of-all-trade”. The wave of media attention came somewhat later. This was largely driven by the fact that these “motherships” have the quality that smaller models can “dock” into them (via APIs) to form a platform-like approach. They are in essence able to leverage the general knowledge of the motherships, and easily spawn narrow subject matter specialists that can be tailored toward countless downstream tasks/applications. And we have seen a ton of those spawn the past couple of months..

By analogy, foundation models can also be seen as majestic knowledge trees. Those that try to map and understand our complex world. This is similar to how we humans learn to understand our world, everything takes place in the same “universe”. We don’t need to remember all the words in a book, we can weave the words into a coherent story and remember it that way. Similarly, having experienced snow in Boston, you can imagine how snow feels in other places without needing to have been there. It is all part of the same universe.
It is easier to attach a leaf of new knowledge to an existing branch. This also explains the power of foundation models as platforms. It is much easier to attach/train a “subject matter specialist” branch to the majestic knowledge tree vs training that capability from scratch. Prior ML approaches looked more like creating a broomstick out of thin air. Today we grow sticks from an existing tree. Elon Musk’s tip for humans to retain more knowledge is equally true for machines

it is important to view knowledge as sort of a semantic tree — make sure you understand the fundamental principles, ie the trunk and big branches, before you get into the leaves/details or there is nothing for them to hang on to.

Dalle-2: Metamorphosis of broomstick to knowledge tree, Max Ernst painting

Key innovations that enabled this step-change:

Among a host of different innovations, four were key for the boom of foundation models we see today:

Transformer architecture
Next-word prediction
Data availability
Compute power

1) Transformer architecture

Before Google’s seismal 2017 paper “Attention is all you need”, ML architecture was fairly siloed across domains. A variety of architectures from RNNs to CNNs were successfully driving performance across the fields of NLP and Computer Vision. However, they provided limited use beyond their domains. Transformer architecture however seems to be breaking those siloes — unifying AI architecture across domains. This in itself can already be taken as a hint that transformer architecture unlocked something more fundamental.
Here is how Transformers are different. Architectures such as CNNs and LTST RNNs can be seen as humans trying to point ML models where in the training data they should initially pay attention to. However, the transformer architecture allows the model to learn dependency for both the input and output sequences jointly, and independently of those human biases.
This substantial architectural difference led to a paradigm shift within the AI community. You can now give the model a ton of data and it will figure out by itself what parts best relate to each other. On top of that, parallelization allows those models to much more effectively utilize the available compute.

2) Next-word prediction

Having an architecture that can find the right dependencies in a vast amount of training data, is only half of the solution. We still need to find an efficient, hands-off optimization function that allows for unsupervised training (no need for human labeling of data). Up until this point, most of the AI models were trained using supervised learning, where humans need to label data for the algorithm to learn from. This step is costly and slow.
To solve this, we had to draw inspiration from neuroscience and philosophy. The way we humans learn our world models is by continuously predicting our environment, and automatically updating our knowledge tree if prediction ≠ reality. For example, you pick up a stone and your brain already calibrates the force you need to pinch the stone and lift it up. These prediction parameters have been fine-tuned throughout your life of interacting with similar objects. Now imagine that this particular type of stone is hollow, and it is the first time you pick one up. You will exert way to much force and be surprised at how light the stone is. Implicitly, your own knowledge tree will be updated and the next time you see a similar stone, your brain will calibrate your muscles to use less initial force. The beauty is that this happens subconsciously after enough repetitions or a big enough surprise. Namely, the bigger the surprise the stronger the impulse your brain gets to adjust the world model.
So how do we apply this to neural nets? The simple notion of predicting the next word out of your large training set is an elegant way to achieve the same notion.
With the combination of transformer architecture, unsupervised learning, and next-word prediction, you can now throw a ton of data at the model and it will learn a “general knowledge tree” by itself!

3) Data availability

In order for these new AI models to work, they need a ton of data about all sorts of topics. Luckily we have been adding data to the internet for almost 30 years since Tim Berners-Lee released the source code for the world’s first web browser and editor in 1993. In 2022 alone, we are expected to produce 94 Zettabytes of data. Put differently, this means 47 iPhones (256GB) that ran out of storage, per person on this earth!

4) Computing power

Thanks to Moore’s law computing power has gotten a lot cheaper. Moreover, with the rise of cloud services, it is now more readily available than ever before. As a consequence, we have been able to train these new models at unprecedented speed & size.
With these four breakthroughs, we now have the tools available to attempt training foundation models that capture a powerful understanding of our world.

The proof is in the pudding

OpenAI with its GPT-N series has set out to find the boundaries of foundation model performance and scale. However, despite the gigantic models, we haven’t yet found the limits of scale to performance.

The bigger the model (#Params), the better the performance (% Accuracy)

This success spurred the dawn of the motherships, where companies are racing to train ever bigger models, both on the centralized side by Google, Meta, OpenAI,.. and the open-source side with a.o. Stability AI.

How good are these models and their downstream applications actually? Instead of losing ourselves in the avalanche of interesting demos, and controversial use cases, let’s instead focus on the actual adoption of these models as a proxy for their perceived value.

OpenAI, the company behind GPT-3 and Dalle-2 has recently been valued at nearly $20bn following a secondary sale to investors like Sequoia and a16z.

Github announced Copilot, in June 2021, which instead of normal text, outputs code. Coplilot is powered by Codex, a version of OpenAI’s famous GPT-3 model, but finetuned to programming tasks. A year later, Github mentioned that:

Over 1.2 million developers used Copilot in the last twelve months, with 40% of the actual code written by Copilot in files where it was enabled.

One of the most spectacular signs of product-market-fit (PMF) comes from Stable diffusion, the open-source counterpart to OpenAI’s Dalle 2. It received +35k Github stars in its first 90 days. Yes, that light blue line is not the Y-axis!

Who wins in this in this new era of AI?

As seen above, there is a huge opportunity of value creation for companies that tap into this growing ecosystem. The conundrum, however, is that with the decreasing marginal cost of intelligence, the barriers to entry diminish significantly. In other words, making it difficult to build a sustainable moat.

There are four types of businesses that will be particularly successful in this new era of AI:

Data companies: With the ever-growing need for more data, these companies will play a pivotal role in the success of AI models. Be it by providing the data, processing it, or labeling it. This holds true both for training the foundation models directly as well as fine-tuning them for downstream applications.
Knowledge tree analogy: Preparing the nutrients to grow the knowledge tree or branch
Mothership analogy: Preparing the raw building material for the mothership and the smaller theme-specific spaceships
Distributed training: with these massive models come equally large training costs. A single GPT-3 training run cost about $12m. Creating efficient ways to train these models and create feedback loops is another problem that could be a huge opportunity for businesses to tap into.
Knowledge tree analogy: The energy for the knowledge tree to absorb the nutrients and grow
Mothership analogy: Assembling the raw materials into the mothership
Applied AI/ Vertical SaaS: Applied AI startups that focus on specific use-cases and plug into the foundation models. They will be able to quickly take these new foundation models and finetune them to achieve amazing performance on their more narrow domains. More importantly, they will streamline the model to unlock targeted value creation through easy UX/UI. While these approaches can quickly lead to impressive results, the barriers to entry are very low. It will be critical for these businesses to find a moat and flywheel around their fine-tuning ability, on top of a stellar UI/UX interface.
Knowledge tree analogy: Building theme-specific branches in the knowledge tree
Mothership analogy: Docking a smaller, theme-specific spaceship into the Mothership
Dark Kitchen Software: As if these motherships are not impressive enough, the next step will be agent-enabled motherships. This means that beyond general knowledge, these models will be able to interact with 3rd party software (i.e. using a computer to the fullest extent). Fast-forwarding this trend, a lot of software users will become AIs. One way to play on this trend would be to create software that scraps normal human UX/UI and purely focuses on API-based interaction of AI models. This will create mean & lean software machines. Similar to the more cost-efficient nature of dark kitchens vs classical restaurants

Let’s connect:

Let’s connect if you are building for this new age of motherships. The age where the marginal cost of intelligence is trending to zero. Especially if you have found a creative way to build a sustainable moat!

twitter.com/sebastian_rtj

When you are living through history, it feels like just another day — Sam Altman