Chapter 14: The World’s 1st On-Chain LLM

5 min readMar 14, 2024

“I don’t think proving large AI models in zero-knowledge is possible”
— Anonymous prehistoric text. Unknown origin and date.

Ethereum has spoken its first word.

We’re excited to announce that on March 13th, we finally completed the ZK proving of the full 1.5 billion parameter GPT2-XL.

Then we verified it on-chain. Which means that forever and immutably inscribed on Ethereum block 19427725 — the record of the first ever LLM output with blockchain security.

And here’s what was said:

“magic”

This is the story of how we got here.

“Generics vs Specialized”

Towards the end of 2023, we began to wrap on our initial work on Remainder, our custom GKR prover built from the ground-up for AI inference. This was an exciting moment for a couple reasons.

For one, it meant that for the first time in Modulus’ history, we’ll finally be able to shift our work to our own tech stack, which itself was built on our insight from a year prior.

And as initial benchmarks started to come back, already hinting at a 1,000x improvement in prover efficiency, we were ecstatic that our hypothesis around specialized provers appeared to be finally coming true.

Cautiously, however, this also meant that for the first time, we’d likely be moving away from an era of using generic proving systems — ZK schemes designed for flexibility and to support complex VM operations. After all, our time with these systems gave us:

The world’s 1st on-chain AI project, “The Rockefeller Bot”
The world’s 1st on-chain AI game, “Leela vs the World”
And the world’s 1st on-chain AI artist, “zkMon”

But we were fed up with the sheer enormity of generic prover overhead for ZKML. And true to our instincts, using Remainder’s unparalleled performance in just the past few months, we’ve already crafted:

The successor to our first paper, “Scaling Intelligence”
The world’s largest ZKML application, Upshot’s “zkPredictor”
And the world’s most expansive ZKML application, Ion’s “Clarity”

Despite the progress, however, there remained two nagging questions:

Can we actually ZK prove the largest AI models dominating the AI zeitgeist today — LLMs?
And, just as key, would a specialized prover approach actually be needed in those regimes?

“fk it why not [llms]”

Large Language Models are, among other things, large. Even the most hardware-friendly modern language models easily top billions in parameter counts.

Which makes them particularly painful in the ZK context. After all, ZK proving is itself an enormously expensive operation.

But as advanced AI — and especially LLMs — gets introduced to our legal, financial, medical, and security sectors, the need for tamper-resistant AI queries appears to only grow larger. So we started by finding a great LLM foundation for ZK benchmarking.

I miss Tony. Can we exchange Elon back for him?

We chose OpenAI’s GPT2-XL. Although outdated against the performance of modern LLMs like GPT-4, it delivers on 2 critical features for the purposes of zkLLM benchmarking:

GPT2-XL exceeds the one billion parameter threshold — the general complexity regime where LLMs begin to be useful
GPT2-XL is built with a relatively straightforward architecture, consisting of just 48 uniformly-sized decoder blocks which feed sequentially into one another (see this wonderful visualization!), making it easy to circuitize

Furthermore, unlike previous attempts that targeted the ZK proving of far smaller LLMs (thereby weakening the relevance of benchmark results) or skipped on-chain settlement altogether (LLM results have never actually been settled to Ethereum — until now), this test must not take shortcuts.

As such, we proceeded to perform the proving in Halo2, the poster child of generic ZK provers. We then recursively aggregated the correct SNARK verifications, compressing the results for final on-chain settlement.

Even with the maturity of generic prover tooling and the relative simplicity of GPT2-XL, however, this was still a hugely difficult task:

Firstly, we attempted to prove each decoder block separately, creating 48 proofs which could then be aggregated in a multi-stage manner, chaining the outputs of one with the inputs of its successor block
However, just one of these ~30M parameter blocks already overwhelmed the Halo2 prover. For our cloud clusters with 1TB RAM dimms + 128 core CPUs, memory exploded
Instead, we further broke up each of the decoder blocks into three sub-blocks (See this reference for a graphical breakdown): the Layer Normalization + Self-Attention block, the 1st Linear Layer + Gelu, and the 2nd Linear Layer + Dropout + Residual Connection

Each of these sub-blocks represented roughly 10M parameters’ worth of computation, for a total of 48 * 3 = 144 sub-blocks. And after aggregating performance across all sub-blocks, we arrived at the real world cost of bringing useful LLM outputs on-chain — using the generic approach:

The total time for generating the proving key was 327,916 seconds — over 91 hours when run on a single machine with 128 core CPUs and 1TB RAM
These 144 proving keys occupied a disk space over 10TB
The total proving time of the 144 sub-blocks was 322,774 seconds — just shy of 90 hours (when run on the same single machine)

And we did it! 200+ hours later, on a 128-core CPU and 1TB RAM machine, we completed the world’s 1st full ZK proving of the inference pass of a billion+ parameter LLM!

“Swan Song for the Generics”

Alright. That’s awesome. We have successfully answered the first question at the top of this blog, while adding another world record to our collection — nice!

But for those keeping an eye on absolute performance, the picture isn’t pretty… In fact, as compared to inference alone, generic ZK proving renders 1,125,761x overhead (yes you’re reading that right — that’s a 1 million x overhead!!). As compared to the most recent benchmarks from Remainder, the generic approach is 1,000 — 10,000x more computationally taxing…

Yikes. Just imagine what this means for deploying verifiable LLMs at scale.

Slide from our BASS presentation at EthDenver

To us, to truly scale verifiable AI, we need specialization. In fact, as we look towards the complex landscape emerging for AI safety, explainability, and compliance, endgame technologies for AI accountability seem more essential than ever.

And we’re much closer than you’d think ;)

“Come Sail Away”

This is… a really cool moment for us. When we first started Modulus, the feasibility of proving any AI model in ZK at all was itself unclear. And here we are, a little over a year later, settling the world’s 1st ZK proof of an 1B+ param LLM query on-chain.

Better yet, we also know what we need to do next to actually make zkLLM queries practical at scale. And did we mention applications already giddy to put this accountable superpower to work?

More to say on both soon. Stay tuned…

That’s all for today! Make sure to stay up-to-date with Modulus via our Twitter, subscribe to our blogs, and for all things ZKML, you’re already in the right place.

“Welcome to a world of pure imagination…

Chapter 14: The World’s 1st On-Chain LLM

“Generics vs Specialized”

“fk it why not [llms]”

“Swan Song for the Generics”

“Come Sail Away”

Written by Modulus Labs