Run a Multimodal Model Locally

Ingrid Stevens
6 min readDec 13, 2023

--

Running local multimodal models opens up a whole host of possibilities! Especially when you don’t need to think about privacy & extra costs that would need to be considered when running GPT-4V.

VS Code | Using LM Studio Inference Server to run a 100% Local Vision Model (llava) to describe an image

Running local multimodal (vision & text) models is very promising since it enables you to develop privately, without exposing images or information to OpenAI.

LM Studio is an application that lets you easily chat with local models in a beautiful GUI, and it takes things up a notch by letting you run a local inference server. This means you can mimic the OpenAI API (and since v.0.2.9 the vision API) using any local vision model!

LM Studio: Local Inference Server (behaves like OpenAI’s API but lets you connect to a local model

In this Story: How To Set Up LM Studio to Run Multimodal Models Locally

In this guide, we’ll use thePsiPi/liuhaotian_llava-v1.5–13b-GGUF Q5_K_M model to show you how. However, there are other vision models you can try out as well!

Join me as we go step by step through how to install LM Studio, download a vision model, start the local inference server, and get your code executing completion requests for both local and web-based images!

Step 1: Install LM Studio

Make sure you have installed the LM Studio app.

Step 2: Download any Vision Model in LM Studio

Download a vision model in LM Studio. This list of LM Studio supported vision models is available: here on HuggingFace.
On HF, they say to “Download the ‘mmproj’ model file as well as one of the primary model files.”

See the screenshot (below) from LM Studio where I search for the BakLLaVA files for details:

  • The “Vision Adapter” is the ‘mmproj’ file
  • I’ve chosen the Q5_K_M model as it has a “slight loss of quality — recommended”
LM Studio: Vision

Step 3: Start the Local Inference Server in LM Studio

In the Server tab, press “Start Server”.
You can copy the code from the vision (python) tab or use the code from this article if you want to try it out with local images.

Step 4: Run the Code (I used VS Code for this step)

Thank you to LM Studio, for providing the code!

Note: I’ve broken the code up for readability, but the entire code chunk is available in one cell at the very end of this article.

I run the code in a Jupyter Notebook in VSCode. You can also run this as a single Python file.

1: Import Libraries:

First, you’ll need to pip install openai

# Adapted from OpenAI's Vision example

# Import necessary libraries
from openai import OpenAI
import base64
import requests

2: Initialize the client:

# Initialize OpenAI client with a custom local server and a placeholder API key (not needed for local server)
# Note: The API key is not required for a local server, as indicated by "not-needed"
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")

3: Link to some images ( I used kitteis.jpg for mine):

kitteis.jpg
# Set up the image URLs

# URL of an image hosted on the web
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Adelie_penguins_in_the_South_Shetland_Islands.jpg/640px-Adelie_penguins_in_the_South_Shetland_Islands.jpg"

# Local file path of an image
image_local = "kitteis.jpeg"

4: Set up a function that takes in either an image file path or URL:

The OpenAI API specifies that images must be base64 encoded. This function encodes them irrespective of whether the image is from the web, or found locally on your computer.

def get_base_64_img(image):
"""
Converts an image from either a local file or a URL to base64 encoding.

Parameters:
- image (str): The image source, which can be a local file path or a URL.

Returns:
str: Base64-encoded representation of the image.
"""

# Check if the image is a local file or a URL
if "http" not in image:
# Local File: Read the binary content of the file, encode it in base64, and decode as UTF-8
base64_image = base64.b64encode(open(image, "rb").read()).decode('utf-8')
else:
# File on the Web: Fetch the image content from the URL, encode it in base64, and decode as UTF-8
response = requests.get(image)
base64_image = base64.b64encode(response.content).decode('utf-8')

# Return the base64-encoded image
return base64_image

5: Create and run the completion request & pass in the image variable:

This is where you can set the temperature (currently at 0) and modify the prompt.

# Create completion request (replace the variable image_local with your image)

completion = client.chat.completions.create(
model="local-model", # not used
temperature=0, # set the temperature
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
# replace the image_local with the variable to your image
"url": f"data:image/jpeg;base64,{get_base_64_img(image_local)}"
},
},
],
}
],
max_tokens=1000,
stream=True
)

for chunk in completion:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)

Results

Web Image Result:

Counting penguins can be challenging 🧐🐧

This image features a group of four penguins standing on the snow, possibly in Antarctica. They are positioned close to each other and appear to be looking at something or someone. The penguins are surrounded by a beautiful landscape with mountains in the background, creating an icy and picturesque scene.

Local Image Result:

We can count cats! 🐱🎉

This image is a cartoon drawing of three cat-like characters sitting at a counter in a space-themed cafe. The cats are wearing astronaut outfits and appear to be enjoying their time together.
The cafe has a dining area with two chairs, one on the left side and another on the right side of the scene. There is also a bench in the background. Various items can be seen on the counter, such as bottles, cups, and bowls. A couch is located near the left edge of the image, adding to the cozy atmosphere of the space-themed cafe.

All the Code in one Cell

Here’s all the code in one cell, for your copying convenience:

# Thanks to LM Studio for most of this code
# Adapted from OpenAI's Vision example

!pip install openai

# Import necessary libraries
from openai import OpenAI
import base64
import requests

# Initialize OpenAI client with a custom local server and a placeholder API key (not needed for local server)
# Note: The API key is not required for a local server, as indicated by "not-needed"
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")

# Set up the image URLs

# URL of an image hosted on the web
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Adelie_penguins_in_the_South_Shetland_Islands.jpg/640px-Adelie_penguins_in_the_South_Shetland_Islands.jpg"

# Local file path of an image
image_local = "kitteis.jpeg"


# define the get_base_64_img function (this function is called in the completion)
def get_base_64_img(image):
"""
Converts an image from either a local file or a URL to base64 encoding.

Parameters:
- image (str): The image source, which can be a local file path or a URL.

Returns:
str: Base64-encoded representation of the image.
"""

# Check if the image is a local file or a URL
if "http" not in image:
# Local File: Read the binary content of the file, encode it in base64, and decode as UTF-8
base64_image = base64.b64encode(open(image, "rb").read()).decode('utf-8')
else:
# File on the Web: Fetch the image content from the URL, encode it in base64, and decode as UTF-8
response = requests.get(image)
base64_image = base64.b64encode(response.content).decode('utf-8')

# Return the base64-encoded image
return base64_image

# Create completion request (replace the variable image_local with your image)

completion = client.chat.completions.create(
model="local-model", # not used
temperature=0, # set the temperature
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
# replace the image_local with the variable to your image
"url": f"data:image/jpeg;base64,{get_base_64_img(image_url)}"
},
},
],
}
],
max_tokens=1000,
stream=True
)

for chunk in completion:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)

With LM Studio’s Local Inference Server and Vision Models, you can seamlessly run vision tasks locally. This powerful combination not only simplifies interaction with local models through a user-friendly GUI but also enables the emulation of the OpenAI API and vision API using your own preferred local vision model.

--

--