Taking a look to OpenAI Assistants API

7 min readDec 27, 2023

Today I took the time to analyse the OpenAI Assistants API to better understand its offering and how it could help on future projects.

Photo by RepentAnd SeekChristJesus on Unsplash

Creating an Assistant

An Assistant is the instance wrapping some details together, where the most important is the tools. The tools are Function Calling, Code Interpreter and Knowledge Retrieval (RAG).

The function calls are triggers to external features that will exists outside the assistant and the LLM itself. Those include APIs call for examples but also .

A brief detour on function calling

The function calling in OpenAI SDK is generally available also in other open source initiatives (such as LocalAI) as — if I understood it right — is a specialization of the LLM in the identification of the function to call, the parameters to pass and .. response to expect (?). Anyway, the docs shows an example like this

// from the OpenAI function call docs 
// https://platform.openai.com/docs/guides/function-calling

import OpenAI from "openai";
const openai = new OpenAI();

// Example dummy function hard coded to return the same weather
// In production, this could be your backend API or an external API
function getCurrentWeather(location, unit = "fahrenheit") {
  if (location.toLowerCase().includes("tokyo")) {
    return JSON.stringify({ location: "Tokyo", temperature: "10", unit: "celsius" });
  } else if (location.toLowerCase().includes("san francisco")) {
    return JSON.stringify({ location: "San Francisco", temperature: "72", unit: "fahrenheit" });
  } else if (location.toLowerCase().includes("paris")) {
    return JSON.stringify({ location: "Paris", temperature: "22", unit: "fahrenheit" });
  } else {
    return JSON.stringify({ location, temperature: "unknown" });
  }
}


async function runConversation() {
  // Step 1: send the conversation and available functions to the model
  const messages = [
    { role: "user", content: "What's the weather like in San Francisco, Tokyo, and Paris?" },
  ];
  const tools = [
    {
      type: "function",
      function: {
        name: "get_current_weather",
        description: "Get the current weather in a given location",
        parameters: {
          type: "object",
          properties: {
            location: {
              type: "string",
              description: "The city and state, e.g. San Francisco, CA",
            },
            unit: { type: "string", enum: ["celsius", "fahrenheit"] },
          },
          required: ["location"],
        },
      },
    },
  ];


  const response = await openai.chat.completions.create({
    model: "gpt-3.5-turbo-1106",
    messages: messages,
    tools: tools,
    tool_choice: "auto", // auto is default, but we'll be explicit
  });
  const responseMessage = response.choices[0].message;

The interesting parts is on the tools definition. A function is basically a reference to a “local” function in the code that describe the parameters. As

— I have been used to the LangChain Tools wrapper, most of this stuff looks new to me and that’s why we are indulging a bit on the topic. We’ll get back to the Assistants API in a moment —

Looking at some examples in the cookbook, it’s also a matter of refining the prompt:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_n_day_weather_forecast",
            "description": "Get an N-day weather forecast",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA",
                    },
                    "format": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "The temperature unit to use. Infer this from the users location.",
                    },
                    "num_days": {
                        "type": "integer",
                        "description": "The number of days to forecast",
                    }
                },
                "required": ["location", "format", "num_days"]
            },
        }
    },
]

messages = []
# Here the example proivdes the constraints to the model
messages.append({"role": "system", "content": "Don't make assumptions about what values to plug into functions. Ask for clarification if a user request is ambiguous."})
messages.append({"role": "user", "content": "What's the weather like today"})

chat_response = chat_completion_request(messages, tools=tools)
assistant_message = chat_response.json()["choices"][0]["message"]
messages.append(assistant_message)

print(assistant_message)

The Code Interpreter

Credits https://www.reddit.com/r/ProgrammerHumor/comments/135iokd/ai_is_taking_over/

As the name suggest this is the tool taking away jobs to developers. It is triggered automatically based on the context and will return relevant content such as code in the prompt or even downloadable files. A sample from docs

const assistant = await openai.beta.assistants.create({
  instructions: "You are a personal math tutor. When asked a math question, write and run code to answer the question.",
  model: "gpt-4-1106-preview",
  tools: [{"type": "code_interpreter"}]
});

Retrieval or “RAG”

As mentioned in the docs to enable RAG add it to the tools list

const assistant = await openai.beta.assistants.create({
  instructions: "You are a customer support chatbot. Use your knowledge base to best respond to customer queries.",
  model: "gpt-4-1106-preview",
  tools: [{"type": "retrieval"}]
});

To use the RAG feature, we have to upload relevant files (of max 512MB and 2M tokens) in different supported format. As of today Retrieval costs $0.2/GB a day per Assisitant, including Assitant and Threads attachments.

Now that we took an in depth look at the “tools” parts we are ready to create a new Assistant. Again taking from the docs

const assistant = await openai.beta.assistants.create({
  name: "Math Tutor",
  instructions: "You are a personal math tutor. Write and run code to answer math questions.",
  tools: [{ type: "code_interpreter" }],
  model: "gpt-4-1106-preview"
});

Adding a Thread

Good we have the Assistant, now a new user want to interact. What I would have called a session is called a Thread there. A Thread represent an interaction between a User and the Assistant. A Thread will contains the list of messages a user and the assistant will exchange.

There is an interesting details on the Threads part, which byte me many times, about context fitting for a long conversation.

One just push the whole exchange as it fits the context and then cries out loud when it size is exceeded. The simplest solution may seem to have larger context sizes but it’s not that simple. Here a nice write-up (for later) on context size and performances.

The documentation clearly states that OpenAI will apply “optimization strategies” that have been extensively tested, such as truncation.
— I like to imagine that hard-coded truncation call somewhere in OpenAI code.. ahah — In their words:

The Assistant will ensure that requests to the model fit within the maximum context window, using relevant optimization techniques such as truncation

How to handle exceeding token limit?

— Ok this is another detour from the topic but since I am not so glad to fall into one company offering, I like to see how to people handle this out there. —

Let’s start with understanding what a token is in OpenAI words:

1 token ~= 4 chars in English
1 token ~= ¾ words
100 tokens ~= 75 words

To summarize and save tokens, if you have a paragraph (and are not Italian) the length may be around 100 token. Do not miss also the tokenizer:

Now that this is clear I found out some different approaches to handle context length

Truncation, chunking and similar “hard cut-off” approaches. I would also add some sort of “slicing window” in the chat history. This approach still seems to me the least effective to avoid loosing meaningful information

Summarization is another way to involve an LLM in the process of extracting meaningful information from the ongoing conversation. Eg. “Summarize schematically the ongoing discussion: {prompt}”

Remove redundant terms leveraging on NLP techniques to just the essential words needed to undestand the concept

RAG-ging which would allow to pick-up just the relevant information from a conversation without altering it too much.

Some other readings on handling long contexts and on the context length problem in general.

Adding messages to threads

Last relevant part is about sending messages. The most interesting ones are the media types supported by the OpenAI APIs multi-modal model, such a images. Unfortunately, as of writing uploading images is not supported but images can be fetched via the ChatGPT 4 engine. Most of the features requiring a file-base interactions will rely on the File API

About LLM multi-modality

ChatGPT 4 supports image explanation and contextualization, similar features are also available from open source models such as LLava, OpenFlamingo and others (with different capabilities).

a LLava response from https://llava-vl.github.io/

To send an image there are two options, one by encoding the image as base64 string (really?) and the other by providing an URL.

Follows a compressed example of upload via base64 and URL.

Notice the use of “detail” with high (or low otherwise) to enable a more detailed (and consuming) analysis of the image. The model will first receive the image (cropped to a square of 512px) and then the image will be cropped in square slices of 512px each passed to the model again.

from openai import OpenAI

client = OpenAI()

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "path_to_your_image.jpg"

# Getting the base64 string
base64_image = encode_image(image_path)

response = client.chat.completions.create(
  model="gpt-4-vision-preview",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What are in these images? Is there any difference between them?",
        },
        # as base64
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          },
          # enable high res mode
          "detail": "high"
        },
        # as URL
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
          "detail": "high"
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0])

Having this overview from the API perspective let’s not miss the limitations and costs sections. Worth noticing the images need to be sent each time, and differ base. Limitations seems to relate to low specialization of the model and costs vary from 85 tokens for low res to 765 tokens for a squared 1024x image (those costs may change in the future check out their website)

Conclusions

Hope we have a clearer understanding of Assistant APIs. Good to have wrapped conversations with users and not waste too much time implementing session (or Threads) and tools handling (such as RAG or APIs call). As of today the Vision API is still not fully supported but seems they have plans to add it.