Running an LLM (Large Language Model) Locally with KoboldCPP

Ahmet
3 min readJul 21, 2023

--

Introduction

In this tutorial, we will demonstrate how to run a Large Language Model (LLM) on your local environment using KoboldCPP. Even if you have little to no prior knowledge about LLM models, you will be able to run it successfully.

Koboldcpp is a self-contained distributable from Concedo that exposes llama.cpp function bindings, allowing it to be used via a simulated Kobold API endpoint. What does it mean? You get llama.cpp with a fancy UI, persistent stories, editing tools, save formats, memory, world info, author’s note, characters, scenarios and everything Kobold and Kobold Lite have to offer. Koboldcpp simplifies the process, allowing you to gain insights into the model’s functionality and boost its performance.

A large language model (LLM) is a type of machine learning model that can perform a variety of natural language processing (NLP) tasks, including generating and classifying text, answering questions in a conversational manner, and translating text from one language to another. Examples of LLMs include BERT, Falcon 40B, GPT-3.5, and Llama.

Let’s get started.

Setting up the Environment

First, we need to download KoboldCPP.

You can download the latest version of it from the following link: https://github.com/LostRuins/koboldcpp/releases.

After finishing the download, move KoboldCPP to a new folder where we will later place our LLM model as well.

Then, we need to pick a model and download the GGML version of the LLM in our folder. You can find it on https://huggingface.co

Here are my recommendations:

Guanaco-7B-GGML : Requires 7GB RAM to run on your computer. Direct download link: https://huggingface.co/TheBloke/guanaco-7B-GGML/resolve/main/guanaco-7B.ggmlv3.q5_1.bin

Nous-Hermes-13B-GGML : Requires 12.26 GB of RAM to run on your computer. Direct download link: https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML/resolve/main/nous-hermes-13b.ggmlv3.q5_1.bin

WizardLM-30B-GGML : Requires 27 GB of RAM to run on your computer. Direct download link: https://huggingface.co/TheBloke/WizardLM-30B-GGML/resolve/main/wizardlm-30b.ggmlv3.q5_1.bin

Don’t forget to place the downloaded models into the same folder as koboldcpp.exe.

Running the LLM Model with KoboldCPP

  1. First, launch koboldcpp.exe. This will open a settings window.
  2. In the settings window, check the boxes for “Streaming Mode” and “Use SmartContext”.
  3. Point to the downloaded model .bin file.
  4. In the “Threads” field, input the number of cores your CPU has. You can check this by opening dxdiag. If your CPU usage is high, you can use fewer threads.

By following these steps, you’ll be able to run the LLM model with KoboldCPP efficiently.

The LLM model will start at http://localhost:5001/.

Click on “Scenarios,” select “New Instruct,” and click “Confirm”.

Then, write whatever you want and submit it.

The response will be slow but functional.

To change your LLM settings, open the settings page at http://localhost:5001/ and make the necessary modifications.

Temperature: Controls randomness (higher for diversity, lower for coherence).

Max Tokens: Sets the maximum output length (in tokens) to maintain relevance.

Top-p Sampling: Dynamically selects tokens based on cumulative probability (balances creativity and control).

Repetition Penalty: Discourages excessive word or phrase repetition in the output. Higher values (e.g., 1.2) minimize repetition, while lower values (e.g., 0.8) allow some repetition.

Amount to Generate: Controls the length of the model’s response in tokens. Specify the desired length (e.g., 100 tokens) for the generated text.

Conclusion

This tutorial has provided a comprehensive guide on running a Large Language Model (LLM) using KoboldCPP in your local environment. Even with limited knowledge of LLM models, you can easily follow the steps outlined here to gain insights into the model’s functionality and enhance its performance.

In summary, with KoboldCPP’s ease of use and the capabilities of LLM models, you have a powerful tool at your disposal to explore and leverage the potential of natural language processing for various tasks and applications.

The response may be slow, but we can still run LLMs locally on low-end PCs.

--

--