SILO: A New LLM Exclusively Trained on Open Text

A small one but augmented with a non-parametric datastore

Benjamin Marie
2 min readAug 14, 2023

SILO is a new LLM with the particularity that it is trained on a curated 228 billion tokens obtained from the public domain and permissively licensed text.

Animation by Min et al., 2023.

In this work, Min et al., 2023 released the data as a new corpus that they called Open License Corpus (OLC). OLC is already available on the Hugging Face Hub and distributed under an Apache 2.0 license (commercial use allowed).

To the best of my knowledge, this is the largest corpus of this kind.

The model itself is rather small with 1.3B parameters. It uses the architecture of LLaMa and the tokenizer of GPT-NeoX. SILO 1.3B is also available on the Hugging Face Hub.

Another particularity of this model is that it can retrieve information from a database during inference. That may explain their choice of a small number of parameters for the model.

SILO is a work proposed by The University of Washington, UC Berkeley, and the Allen Institute for AI (Min et al., 2023).

This short article was originally published in The Weekly Kaitchup. To receive in your mailbox exclusive articles and all my AI notebooks, subscribe here:

--

--

Benjamin Marie

Ph.D, research scientist in NLP/AI. Medium "Top writer" in AI and Technology. Exclusive articles and all my AI notebooks on https://kaitchup.substack.com/