Building GPT2o — Part 1 : Audio
code, pretrained model, colab notebook
Having been inspired by Andrej’s “Let’s Reproduce GPT2”, i wanted to try my hand at building an LLM from scratch. However, What if we could build a GPT2 which could generate audio in an autoregressive fashion? Much like OpenAI’s GPT4o, or the earlier (but not released sigh) AudioPaLM . And preferably being able to train on consumer hardware. Not everyone has a bunch of A100s lying around :(
Tokenisation
Since we are dealing with audio we need a way to tokenise audio similar to how we tokenise text.
Enter SNAC. I saw this paper/model on LIAON’s call to build open multimodal models. It converts audio into discrete tokens via an hierarchical structure. And it works really well, I couldn’t tell the difference between the original and reconstructed. Using their model and a tutorial on how to convert it into a list of tokens I made a tokeniser which can take audio and convert to discrete tokens for LM training.
Dataset
Having conveniently missed LIAON’s pretokenised dataset that I just saw now on their call for multimodality page. I decided to use a public domain recording of the Adventures of Sherlock Holmes from Librivox, which I have also uploaded to huggingface here.
It has a total of 12 hours-ish of clean audio, totalling to about 1.5 Million SNAC Tokens.
Training & Inference
There are very minimal changes to the orignal code from Andrej. In fact, the only change was a new dataprocessing script instead of fine web, and other minor changes like context length, batch size and vocab size. But even then, i have somehow managed to break the DDP version… To be fixed later. Its quite simple to train, see my code above and in an hour or two on Colab you can get yourself a model that tells you this. I am not joking, it wasnt scripted. That was the 3rd generation in my first test inference.
btw, yes the model is overfit, and garble 99% of the time. This was simply a proof of concept and I'm sure with more data more compute bigger model it would be much better.
Scale is all you need
What next:
- more data
- more compute
- bigger model
- audio instruction tuning?
- mix text & audio
- images? Bytedance have a similar discrete tokenizer for images that might be worth looking at. An image is worth 32 tokens