Running Llama2 Locally on the CPU with the Cheshire Cat

Nicola Corbellini
Mad Chatter Tea Party
5 min readAug 21, 2023

--

Example minimal setup for running a quantized version of LLama2 locally on the CPU with the Cheshire Cat.

Artificially generated with Freepik

In this article, I want to show you how you can integrate a local Large Language Model (LLM) with the Cheshire Cat. A guide for serving a LLM behind a REST API server already exists:

Yet, the aforementioned guide is meant to be agnostic with respect to the Operating System (OS), the LLM and the inference framework. Therefore, in this tutorial, I would like to show you a specific and practical example to chat with a quantized version of Llama2–7B using the Cheshire Cat.

The general idea is to develop a very simple REST API server and serve the LLM behind an inference endpoint. The Cheshire Cat will send a HTTP POST request to this endpoint, triggering the model to answer the user’s message. Hence, the endpoint will return the LLM answer to the Cat.

System Specifications

In this example, I’ll be using an Ubuntu 22.04 machine with an Intel® Core™ i7–8565U CPU @ 1.80GHz × 8…

--

--

Nicola Corbellini
Mad Chatter Tea Party

I'm a PhD student in Computer Science and Machine Learning enthusiast believing AI could have a positive impact for human behavior understanding and social good