Llama 3: Running locally in just 2 steps

Renjith R Krishnan
6 min readApr 25, 2024

--

Llama-3 meets Windows!

In my previous article, I covered Llama-3’s highlights and prompting examples, using a hosted platform (IBM watsonx). In this article we will see how to quickly setup and execute a Llama-3 model locally in a Windows machine, without needing WSL (Windows Subsystem for Linux) or GPUs. Similar instructions are available for Linux/Mac systems too.

To setup Llama-3 locally, we will use Ollama — an open-source framework that enables open-source Large Language Models (LLMs) to run locally in computer.

Hardware Pre-requisites: A recommended system configuration for installing Ollama is given below.

CPU: Any modern CPU with at least 4 cores recommended for running smaller models. For running 13B models, CPU with at least 8 cores is recommended. GPU is optional for Ollama, but if available can improve the performance drastically.

RAM: At least 8 GB of RAM to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

Disk Capacity: Recommend at least 12 GB of disk space available, to install Ollama and the base models. Additional space will be required if more models are planned to be installed.

If you are good with the recommended Hardware configuration, please proceed further to perform the two-step setup process.

Step-1:

Head to Ollama’s download page to download the Ollama installation file. Here onwards, I will focus on Windows based installation, but similar steps are available for Linux / Mac OS too. Please refer the download page for OS specific install instructions.

Once the Ollama Windows executable file (~212 MB size) is downloaded, you can execute it and follow the installation wizard to complete the installation.

Installation Wizard (Windows)

Once the installation is successful, you will see the Ollama icon in the taskbar. Ollama server is now up and running, listening to HTTP port 11434 of your localhost, to serve inference API request. You can confirm the Ollama server status by hitting the local URL http://localhost:11434/, which should give the status as below.

Ollama Server — Status

Step-2:

Open a windows terminal (command-prompt) and execute the following Ollama command, to run Llama-3 model locally.

> ollama run llama3

Note that “llama3” in the above command is an abbreviation for the llama3 8B instruct model, which is appropriate for running locally. If you need the higher version (70B) of this model, you must specify llama3:70b in the run command. But note that it would require higher hardware configuration for performance reasons. For the pre-trained base variant of Llama3 models, you must use llama3:text and llama3:70b-text respectively.

When a new model is run for the first time, Ollama will start downloading that model to your local system, as shown below. This takes a few minutes, depending on the model size and your network bandwidth. Lamma-3 8B Instruct model, takes about ~4.7 GB download size.

Ollama Downloading Model (Llama3)

Once the model is downloaded, Ollama is ready to serve the model, by taking prompt messages, as shown above.

Note that for any subsequent “run” commands, Ollama will use the local model. Windows Users can find the downloaded model files at the following location.

“C:\Users\<username>\.ollama\models”.

Use the “list” command to find the available local models in your machine.

> ollama list

NAME ID SIZE MODIFIED
llama3:latest a6990ed6be41 4.7 GB 36 hours ago

If an upgraded version of the model is released in Ollama, and if you would like to refresh your local model to the latest, then use the “pull” command. This will download the delta layers.

> ollama pull llama3

At any point in time, if you would like to remove the model from your local installation, then use the “rm” command

> ollama rm llama3

Prompting the local Llama-3

Now that we have completed the Llama-3 local setup, let us see how to execute our prompts. There are three ways to execute prompts with Ollama. Let us look at it one by one.

command-line: This is the simplest of all option. As we saw in Step-2, with the run command, Ollama command-line is ready to accept prompt messages. We can type in the prompt message there, to get Llama-3 responses, as shown below. To exit the conversation, type the command /bye.

Ollama command-line — Request & Response

ReST API (HTTP Request): As we saw in Step-1, Ollama is ready to serve Inference API requests, on local HTTP port 11434 (default). You can hit the Inference API endpoint with HTTP POST request containing the prompt message payload. Here is an example of a CURL request for a prompt message: “Tell me a fact about Llama?.” The response taught me something new about Llama’s communication system and vast vocabulary. Llama must be a logophile :-)

curl -X POST http://localhost:11434/api/generate -d "{\"model\": \"llama3\",  \"prompt\":\"Tell me a fact about Llama?\", \"stream\": false}"

{"model":"llama3","created_at":"2024-04-24T21:22:07.5071017Z","response":"Here's one:\n\nLlamas have a unique communication system that involves over 30 different vocalizations, including soft humming sounds, loud screaming calls, and even a \"banana-like\" sound to alert others of potential threats. They are also known for their ability to produce a range of facial expressions, which can help them convey emotions and intentions to other llamas!","done":true,"context":[128006,882,128007,198,198,41551,757,264,2144,922,445,81101,30,128009,128006,78191,128007,198,198,8586,596,832,512,198,43,24705,300,617,264,5016,10758,1887,430,18065,927,220,966,2204,26480,8200,11,2737,8579,87427,10578,11,17813,35101,6880,11,323,1524,264,330,88847,12970,1,5222,311,5225,3885,315,4754,18208,13,2435,527,1101,3967,369,872,5845,311,8356,264,2134,315,28900,24282,11,902,649,1520,1124,20599,21958,323,34334,311,1023,9507,29189,0,128009],"total_duration":42119164300,"load_duration":6928089900,"prompt_eval_count":18,"prompt_eval_duration":4440385000,"eval_count":74,"eval_duration":30746057000}

Note that I set the “stream” flag as “false” in the CURL request, to get all responses at once. The default value for “stream” is true, in which case, you will receive multiple HTTP responses with a streaming result of tokens. For the last response of the streaming results, the “done” attribute will be returned as “true”.

Programmatic execution using langchain: This is the most powerful way of inferencing Ollama models, as it provides flexibility and integration with data sources. I assume you are already familiar with setting up virtual environment for installing and executing Python packages, either by using venv or conda. If not you can follow the instructions for venv setup or conda setup to setup Python virtual environment.

Once you have the virtual environment created, install the langchain dependencies in that environment using pip, as shown below.

pip install langchain-community

Now you can run Ollama LLMs using the python code below.

from langchain_community.llms import Ollama

llm = Ollama(model="llama3")
prompt = "Tell me a joke about llama"
result = llm.invoke(prompt)
print(result)
# 'Why did the llama go to the party?\n\nBecause it was a hair-raising experience!'

This will execute the prompt and display the results (as shown in comments). This time I wanted Llama3 to crack a joke about itself. Looking at the results, it wasn’t a bad attempt :-)

Note that llm.invoke() method will get the results all at once, when the token generation is complete. In my case, it took ~25s to get the response. But a better performant experience can be achieved using the streaming method — llm.stream(). Let us look at that.

from langchain_community.llms import Ollama

llm = Ollama(model="llama3")
prompt = "Tell me a joke about llama"

for chunks in llm.stream(prompt):
print(chunks, end="")

# Here's one:
# Why did the llama go to the party?
# Because it was a hair-raising experience!
# Hope that made you smile!

In this approach, the results (shown in comments) started streaming at about 5s of execution, which is certainly a better User experience. Note that I have set “end” parameter in the print() method with an empty string. This is because, by default, print() method adds a new-line character at the end, causing each token in the response to appear in a new-line, instead of a continuous string. Overriding the end new-line character with an empty string, will print a continuous string of response.

With this Local Llama3 setup, now you can create personalized assistants or automation agents, that can run locally without external dependencies. In the next article we will develop a simple personalized assistant to take advantage of the Local Llama3 setup. See you there. Thanks for reading!

--

--