Ollama & Ollama in Windows

7 min readMar 3, 2024

Ollama primarily refers to a framework and library for working with large language models (LLMs) locally.A framework for running LLMs locally: Ollama is a lightweight and extensible framework that allows you to easily run large language models like Llama 2, Mistral, and Gemma on your own computer. This can be useful for developers who want to experiment with LLMs or for researchers who want to study their behavior in a controlled environment.

What is Ollama?

Ollama empowers you to acquire the open-source model for local usage. It automatically fetches models from optimal sources and, if your computer has a dedicated GPU, it seamlessly employs GPU acceleration without requiring manual configuration. Customizing the model is easily achievable by modifying the prompt, and Langchain is not a prerequisite for this. Additionally, Ollama can be accessed as a docker image, allowing you to deploy your personalized model as a docker container.

Framework

Ollama provides a lightweight and user-friendly way to set up and run various open-source LLMs on your own computer. This eliminates the need for complex configurations or relying on external servers, making it ideal for various purposes:

Development: It allows developers to experiment and iterate quickly on LLM projects without needing to deploy them to the cloud.
Research: Researchers can use Ollama to study LLM behavior in a controlled environment, facilitating in-depth analysis.
Privacy: Running LLMs locally ensures that your data never leaves your machine, which is crucial for sensitive information.

Ollama Framework:

Simple Setup: Ollama eliminates the need for complex configuration files or deployments. Its Modelfiles define the necessary components like model weights, configurations, and data, simplifying the setup process.
Customization: Ollama allows you to customize the LLM experience. You can adjust parameters like batch size, sequence length, and beam search settings to fine-tune the models for your specific needs.
Multi-GPU Support: Ollama can leverage multiple GPUs on your machine, resulting in faster inference and improved performance for resource-intensive tasks.
Extensible Architecture: The framework is designed to be modular and extensible. You can easily integrate your own custom modules or explore community-developed plugins for extending functionalities.

Library:

Ollama comes with a pre-built library of trained language models, such as:

Llama 2: A large language model capable of various tasks like text generation, translation, and question answering.
Mistral: A factual language model trained on a massive dataset of text and code.
Gemma: A conversational language model designed for engaging dialogue.
LLaVA: A robust model trained for both chat and instruction use cases.

This library allows you to easily integrate these pre-trained models into your applications, eliminating the need to train them from scratch, saving time and resources.

Features Of Ollama

Ease of Use:

Simple Installation: Ollama utilizes pre-defined “Modelfiles” that eliminate complex configurations, making installation and setup accessible even for users with limited technical expertise.
User-Friendly API: Ollama interacts with pre-trained models through a straightforward API, allowing developers to easily integrate LLMs into their Python applications.

Extensibility:

Customizable Models: Ollama allows adjustments to various parameters, enabling users to fine-tune models for specific tasks and preferences.
Modular Architecture: The framework supports custom modules and community-developed plugins, facilitating extensibility and customization based on individual needs.

Powerful Functionality:

Pre-trained Models: Ollama provides a library of pre-trained LLMs capable of numerous tasks, like text generation, translation, question answering, and code generation.
Local Execution: LLMs run entirely on your machine, eliminating the need for cloud deployments and ensuring data privacy.
Multi-GPU Support: Ollama leverages multiple GPUs for faster inference and improved performance on resource-intensive tasks.

Open-source and Collaborative:

Freely Available: Ollama’s open-source nature allows anyone to contribute to its development and benefit from community-driven improvements.
Continuously Evolving: Ollama is actively maintained, with ongoing updates and enhancements released regularly.

Additional features:

Lightweight: Ollama operates efficiently, making it suitable for computers with limited hardware resources.
Offline Capabilities: Pre-trained models can be used even without an internet connection, providing flexibility and accessibility.

Ollama in Windows:

Ollama is now available on Windows in preview, making it possible to pull, run and create large language models in a new native Windows experience. Ollama on Windows includes built-in GPU acceleration, access to the full model library, and the Ollama API including OpenAI compatibility.

Hardware acceleration

Ollama accelerates running models using NVIDIA GPUs as well as modern CPU instruction sets such as AVX and AVX2 if available. No configuration or virtualization required!

Full access to the model library

The full Ollama model library is available to run on Windows, including vision models. When running vision models such as LLaVA 1.6, images can be dragged and dropped into ollama run to add them to a message.

Always-on Ollama API

Ollama’s API automatically runs in the background, serving on http://localhost:11434. Tools and applications can connect to it without any additional setup.

For example, here’s how to invoke Ollama’s API using PowerShell:

(Invoke-WebRequest -method POST -Body '{"model":"llama2", "prompt":"Why is the sky blue?", "stream": false}' -uri http://localhost:11434/api/generate ).Content | ConvertFrom-json

Ollama on Windows also supports the same OpenAI compatibility as on other platforms, making it possible to use existing tooling built for OpenAI with local models via Ollama.

To get started with the Ollama on Windows Preview:

Download Ollama on Windows
Double-click the installer, OllamaSetup.exe
After installing, open your favorite terminal and run ollama run llama2 to run a model

Ollama will prompt for updates as new releases become available.

ollama serve:

This command starts the Ollama server, making the downloaded models accessible through an API. This allows you to interact with the models from various applications like web browsers, mobile apps, or custom scripts.

Here’s an analogy: Imagine Ollama serve as a library holding your books (LLMs). When you run ollama serve, it's like opening the library, making the books accessible for anyone to read (interact) through the library's system (API).

ollama run phi:

This command specifically deals with downloading and running the “phi” model on your local machine. “phi” refers to a pre-trained LLM available in the Ollama library with capabilities similar to GPT-3.

Here’s the analogy extension: If ollama serve opens the library, ollama run phi is like requesting a specific book (phi) from the librarian (Ollama) and then reading it (running the model) within the library (your local machine).

ollama serve requires downloaded models beforehand. Use ollama pull <model_name> to download specific models.
ollama run phi downloads and runs the “phi” model specifically.
ollama serve is for providing access to downloaded models through an API, while ollama run phi focuses on running a single model locally.

General Commands:

ollama list: Lists all downloaded models on your system.
ollama rm <model_name>: Removes a downloaded model from your system.
ollama cp <model_name1> <model_name2>: Creates a copy of a downloaded model with a new name.
ollama info <model_name>: Displays information about a downloaded model.
ollama help: Provides help documentation for all available commands.

Model Management:

ollama pull <model_name>: Downloads a model from the Ollama model hub.

Running Models:

ollama run <model_name>: Runs a downloaded model locally.
ollama serve: Starts the Ollama server, making downloaded models accessible through an API.

Additional Commands:

ollama update: Updates Ollama to the latest version.
ollama config: Manages Ollama configuration settings.

Ollama via Langchain:

from langchain_community.llms import Ollama

llm = Ollama(model="llama2")

llm.invoke("Tell me a joke")

"Sure! Here's a quick one:\n\nWhy don't scientists trust atoms?\nBecause they make up everything!\n\nI hope that brought a smile to your face!"

To stream tokens, use the .stream(...) method:

query = "Tell me a joke"

for chunks in llm.stream(query):
    print(chunks)



S
ure
,
 here
'
s
 one
:




Why
 don
'
t
 scient
ists
 trust
 atoms
?


B
ecause
 they
 make
 up
 everything
!




I
 hope
 you
 found
 that
 am
using
!
 Do
 you
 want
 to
 hear
 another
 one
?

Multi-modal

Ollama has support for multi-modal LLMs, such as bakllava and llava.

ollama pull bakllava

Be sure to update Ollama so that you have the most recent version to support multi-modal.

from langchain_community.llms import Ollama

bakllava = Ollama(model="bakllava")

import base64
from io import BytesIO

from IPython.display import HTML, display
from PIL import Image


def convert_to_base64(pil_image):
    """
    Convert PIL images to Base64 encoded strings

    :param pil_image: PIL image
    :return: Re-sized Base64 string
    """

    buffered = BytesIO()
    pil_image.save(buffered, format="JPEG")  # You can change the format if needed
    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
    return img_str


def plt_img_base64(img_base64):
    """
    Display base64 encoded string as image

    :param img_base64:  Base64 string
    """
    # Create an HTML img tag with the base64 string as the source
    image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
    # Display the image by rendering the HTML
    display(HTML(image_html))


file_path = "../../../static/img/ollama_example_img.jpg"
pil_image = Image.open(file_path)
image_b64 = convert_to_base64(pil_image)
plt_img_base64(image_b64)

llm_with_image_context = bakllava.bind(images=[image_b64])
llm_with_image_context.invoke("What is the dollar based gross retention rate:")

'90%'