Necessity is the mother of (GenAI) invention

Published in

Barnacle Labs

6 min readAug 8, 2023

Contrary to the perception of massive models and vast GPU compute farms, a lot of innovation in the Generative AI space is occurring at a simpler and more accessible level. Perhaps out of necessity, technical communities are diligently working to create models that are smaller, faster, and more efficient.

Today, you can run large language models (LLMs) on your desktop, in your browser, and even on your smartphone. The era of behemoth-scale operations is not over, but a more nimble approach is emerging and worth watching. This shift is being driven by the unwieldy and expensive nature of relying on the extensive compute resources demanded by large models — as Plato said, “necessity is the mother of invention”.

Actually, to be accurate, Plato didn’t say that. What he did say is “our need will be the real creator” — a phrase that has been morphed over the years into “necessity is the mother of invention”.

Let’s take a look at some examples.

Tiny generative image algorithms

Nvidia Perfusion is a generative image algorithm with a difference.

The difference is that it’s only 100k in size and was trained in just 4 minutes.

Its results are claimed to be comparable to those of Stable Diffusion.

Of course it’s not quite Stable Diffusion, but it’s very impressive. It doesn’t generate images from scratch, rather it generates variants of an existing image. Still, the results are impressive and that 4 minute training time is huge! Well, small actually 😉

Smaller (Large) Language Models

Quantization is a method of shrinking a model to a fraction of its original size, whilst retaining the original performance.

Large models can have excessive memory requirements. For example, a large model might need hundreds of megabytes of memory, excluding it from being run on anything other than the largest and most sophisticated cloud infrastructures. However, if instead of using default of 32bit floating point precision, we can get away with 8 or even 4bit precision, those memory sizes start tumbling dramatically.

This paper neatly demonstrates that model performance actually slightly increased as the models were moved from 32 to 16, to 8 and eventually to 4bit precision. Counterintuitively, lower resource requirements did not equate to lower performance.

TheBloke’s Hugging Face page is full of open source models that have had quantisation performed to drastically reduce their sizes and memory requirements. It’s quite incredible that the community has reacted so rapidly — the QLoRA paper on which this work is based was only published a few months ago.

Optimised inference in C++

llama.cpp is a reimplementation of the LLaMA inference engine in C++. Why do this? C++ is known as “bare metal” programming because it’s much closer to the hardware and with less performance-sapping abstractions.

llama.cpp also includes optimisations for ARM processors and the Apple Accelerate and Metal frameworks, ensuring it takes full advantage of the hardware and software capabilities and optimisations that Apple has provided.

As a result llama.cpp, together with quantised models, makes it possible to operate LLMs quite effectively on a MacBook laptop. No cloud, no servers, no data going outside the local laptop.

LLMs on your desktop

Llama.cpp makes desktop installation of LLMs possible, but it can still be a complex task for those not familiar with the tools and technologies involved. Never typed pip install into a terminal before? You might struggle.

Luckily there’s an active community of hackers working to solve this problem. There’s now a profusion of projects that aim to make LLM installation easy and quick. Two of my favorites are LocalAI and Ollama.

Ollama is probably the most seamless experience — install the app and type ollama run llama2 into your terminal. Hey presto, you’re chatting to a locally installed LLM on your MacBook!

LocalAI isn’t quite as seamless, but is much more ambitious. It provides a single OpenAI-compatible API across all the different models, even going as far as to support OpenAI functions.

A native mac chat app

Perhaps you’ve installed LocalAI or Ollama, but don’t fancy chatting with the very bare-bones interfaces. No problem, yours truly has created a native mac app that supports chatting via either. It also includes support for OpenAI, so you can register for a key and get the benefits of ChatGPT without paying that fixed monthly fee.

LLMs on your smartphone

llama.cpp is cool, but some wanted to go even further. MLC LLM is solution for running LLMs on iPhones and Android devices.

You can grab the MLC LLM iPhone app on the Apple App Store — that’s an LLM running 100% locally on your iPhone — no cloud, everything running on the device.

LLMs in your browser

Continuing the theme of running LLMs in unusual places, Web LLM is a companion project to MLC LLM that runs LLMs in your browser.

That’s the full LLM running within the browser on your device, not just the interface. Again: zero cloud and zero data going outside of your device.

Hint: you need a browser that supports WebGPU, which are a little thin on the ground at the moment. The latest release of Chrome worked for me.

Summary

Optimisation follows invention and it seems that we’re entering the optimisation phase of Generative AI.

Open source is driving a lot of this innovation, probably out of necessity — few in the open source community have the money to pay for giant GPU clusters, so making things faster, smaller and cheaper is essential.

However, we’re also in a climate emergency — the optimisations and innovations that are emerging through necessity are setting a new direction, one much more consistent with the context we find ourselves in.

Large organisations with hundreds of millions of funding behind them don’t have the same focus on making things smaller, cheaper, faster and more energy efficient. They can thrown money at the problem, precisely because they have money. They don’t have the need.

In contrast, those without money have to find other ways. Necessity is the mother of (GenAI) invention and the open source community is highly motivated to find ways to reduce energy consumption, because high energy consumption brings with it high costs. This is a big part of how and why the open source community adds value.