SILO: A New LLM Exclusively Trained on Open Text

A small one but augmented with a non-parametric datastore

2 min readAug 14, 2023

SILO is a new LLM with the particularity that it is trained on a curated 228 billion tokens obtained from the public domain and permissively licensed text.

In this work, Min et al., 2023 released the data as a new corpus that they called Open License Corpus (OLC). OLC is already available on the Hugging Face Hub and distributed under an Apache 2.0 license (commercial use allowed).

To the best of my knowledge, this is the largest corpus of this kind.

The model itself is rather small with 1.3B parameters. It uses the architecture of LLaMa and the tokenizer of GPT-NeoX. SILO 1.3B is also available on the Hugging Face Hub.

Another particularity of this model is that it can retrieve information from a database during inference. That may explain their choice of a small number of parameters for the model.

SILO is a work proposed by The University of Washington, UC Berkeley, and the Allen Institute for AI (Min et al., 2023).

This short article was originally published in The Weekly Kaitchup. To receive in your mailbox exclusive articles and all my AI notebooks, subscribe here:

The Kaitchup | Benjamin Marie | Substack

Weekly news and tutorials to pre-train, fine-tune, run, and serve the most recent large language models on consumer…

kaitchup.substack.com

SILO: A New LLM Exclusively Trained on Open Text

A small one but augmented with a non-parametric datastore

The Kaitchup | Benjamin Marie | Substack

Weekly news and tutorials to pre-train, fine-tune, run, and serve the most recent large language models on consumer…

Written by Benjamin Marie