The Problems that Attract the Smartest People
The roller-coaster that is generative AI keeps chugging on, and for those of us sitting in the audience, it is more than entertaining.
For a brief moment this week I was determined to get in on the action and contribute something little to the remarkable community that’s grown around Meta AI’s LLaMA by writing python bindings for the most popular project.
Georgi Gerganov’s llama.cpp looks to be the most significant success in the sea of activity surrounding the infamous weights, continuing where forks and other inspired projects seem to have fallen by the side. I promise you, daily reading of issues, discussions, PRs and commits is worth the effort.
Over the last two days, a most significant PR by Justine Tunney allowed for llama.cpp to load ggml weights with about fifty-percent less RAM. The changes were not without controversy, but the response was generally positive. I followed the discussion when mmapping the weights was first brought up about three weeks ago, because this change would allow me to run larger, more capable models on my humble machine.
Here’s a quick summary of the changes made. To run inference on (ie. to “use”) an ML model, the model’s weights will have to be loaded in RAM, so they can be accessed quickly by the running process. What this means is, to run inference on a 4-gig model, a device will have to have at least that much RAM to spare to hold the model in memory.
Memory mapping allows the process to receive just enough of the large model weights needed to run inference, delegating the actual I/O to the operating system. What this means in practice is that, at least on initial load, an inference job (you asking llama.cpp to do something) will not need to load n gigabytes of weights even before it begins the task.
Before news of the upgrade was sent to Hacker News, I only knew of jart as just one of many contributors that had flocked to Gerganov’s inspiring project to make LLM-accessibility a thing. Last night, I learned who she was, and that’s when it struck me.
Paraphrasing Paul Graham, “the most ambitious challenges attract the smartest people”. You see that sentiment in some of his essays, but it is quite the experience witnessing it unfold before you.
(The same goes for all sorts of problem spaces: if it’s scammy, it will attract the kind of people who’d play in moral grey areas for personal profit; if it’s bureaucratic-authoritarian, yup, you know who you’ll find there!)
Of course, my referral to Graham in this context uses “smart” in a rather limited sense, and complex challenges are multi-faceted, inviting all sorts of people to play within the many niches they provide. However, we can agree that the kinds of challenges/opportunities/problems people are drawn to tell us a lot about who they are.
Making inference cheaper and more efficient is perhaps the most important reaction to the AI spring we’re currently enduring. In a follow-up text, I will outline what I think is at stake, but to summarise here, democratising the means of digital production is the difference between a dystopian authoritarian future and an empowered (and admittedly chaotic) humanity.
Such a mission calls out to no slouches. Artificial Intelligence is hard. Large Language Models are more than glorified autocorrect. They are hard. Making them efficient is hard. It is also not in the immediate interests of large corporations and well-endowed research labs that can afford several thousands of dollars of compute, storage and mass deployment.
Incentives guide action. More efficient LLMs will do OpenAI, MetaAI, Google and the like so much good, saving them millions in the process and allowing them to be even more ambitious. But we’ve reached a point in our journey towards even more general AI where corporations have significant lead over their competition and the rest of the open community. They are also rightly concerned about the costs of this innovation to humanity. This will not make them stop. Only winning will.
What this means is, once training and inference are cheap enough for their budgets, they may not be inclined to go further, especially if resources need to be pulled into actually making their technology profitable. How many Google engineers do you think will be asked to put in the work to make an LLM run on an old iPhone when, you know, you could just consume a Google Cloud API?
It’s up to the little ones to figure this out. And with stakes this high, titans such as jart, Kevin Kwok, Gerganov and the hundreds of clever enthusiasts have stepped up to make it happen.
Keeping up can be hard in a space so exciting. With this pace of development, following the news is probably harder than actually contributing code, since the best minds are actively building out the foundations of a more open AI future.
Following the progress, however, is its own thrill.
If you enjoyed this, let me know.