Running Llama2 Locally on the CPU with the Cheshire Cat

Published in

Mad Chatter Tea Party

5 min readAug 21, 2023

Example minimal setup for running a quantized version of LLama2 locally on the CPU with the Cheshire Cat.

In this article, I want to show you how you can integrate a local Large Language Model (LLM) with the Cheshire Cat. A guide for serving a LLM behind a REST API server already exists:

🍽️ Serving a Custom Large Language Model (LLM)

How to setup the Cheshire Cat to run a custom Large Language Model (LLM). The Cheshire Cat offers two ways to setup a…

cheshirecat.ai

Yet, the aforementioned guide is meant to be agnostic with respect to the Operating System (OS), the LLM and the inference framework. Therefore, in this tutorial, I would like to show you a specific and practical example to chat with a quantized version of Llama2–7B using the Cheshire Cat.

The general idea is to develop a very simple REST API server and serve the LLM behind an inference endpoint. The Cheshire Cat will send a HTTP POST request to this endpoint, triggering the model to answer the user’s message. Hence, the endpoint will return the LLM answer to the Cat.

System Specifications

In this example, I’ll be using an Ubuntu 22.04 machine with an Intel® Core™ i7–8565U CPU @ 1.80GHz × 8…

Running Llama2 Locally on the CPU with the Cheshire Cat

🍽️ Serving a Custom Large Language Model (LLM)

How to setup the Cheshire Cat to run a custom Large Language Model (LLM). The Cheshire Cat offers two ways to setup a…

System Specifications

Written by Nicola Corbellini