Running Llama2 Locally on the CPU with the Cheshire Cat
Example minimal setup for running a quantized version of LLama2 locally on the CPU with the Cheshire Cat.
In this article, I want to show you how you can integrate a local Large Language Model (LLM) with the Cheshire Cat. A guide for serving a LLM behind a REST API server already exists:
Yet, the aforementioned guide is meant to be agnostic with respect to the Operating System (OS), the LLM and the inference framework. Therefore, in this tutorial, I would like to show you a specific and practical example to chat with a quantized version of Llama2–7B using the Cheshire Cat.
The general idea is to develop a very simple REST API server and serve the LLM behind an inference endpoint. The Cheshire Cat will send a HTTP POST request to this endpoint, triggering the model to answer the user’s message. Hence, the endpoint will return the LLM answer to the Cat.
System Specifications
In this example, I’ll be using an Ubuntu 22.04 machine with an Intel® Core™ i7–8565U CPU @ 1.80GHz × 8…