Tolkien: Clojure library for accurate* token counting for OpenAI APIs

Łukasz Korecki
5 min readSep 6, 2023

--

Credit: Midjourney

ChatGPT and OpenAI need no introduction, Large Language Models took parts of the tech world by storm. It feels like a lifetime ago but this tech is relatively new and still exciting.

I found LLMs to be force multiplier when it comes to adding features that automate common workflows, remove the tedium and overall let users focus on the actually hard things. Things that were near impossible without a lot of resources, people and time are now trivially implementable and relatively affordable. Want to figure out who needs to follow up in the next meeting, based on a call transcript? No sweat.

Well, kinda… At some point, if the things you’re working on get complicated enough — you’re going to run into the window context size problem:

InvalidRequestError: This model’s maximum context length is 16385 tokens. 
However, your messages resulted in 19000 tokens.
Please reduce the length of the messages.

The limit determines how much of the input text the model can see and use to make predictions or generate responses.

So you start searching around trying to figure out what’s going on and how to fix this. And things start to click:

Large Language Models can only process a fixed sized amount of data: not only the input, but also the output. So you’re now in the business of counting tokens and quickly realize that you’re better off always using gpt-3.5-turbo-16k model. Maybe.

I’m counting on you

OpenAI published some documentation about how to do this, and also made available a Python library called TikToken.

It’s really useful — pick a model, pass some text — get the count. But… You’re not done yet…

If you poke around the repository — you’ll find this guide on how to actually count tokens for use with OpenAI’s Chat Completion API.

It’s not as straight forward as you might think, and the guide doesn’t account for all features of the API! You have to ensure to include all of these in your final count:

  • the tokens in the user input
  • the tokens in your prompt
  • message structure (role, content and other attributes)
  • function calls, including the JSON schema for the response
  • and a bunch of other things like the temperature parameter

So you hack some code to count all of these and you’re still way off. Why?

This thread posted on OpenAI forums is quite enlightening — turns out if you use function_call functionality, the calculations get very tricky.

Under the hood, it seems that the JSON schema and function call data gets turned into… Typescript. At least that’s what brave people who tried to reverse engineer and prompt hack their way through this found. As you can guess, that’s completely implementation-specific so you’re at the mercy of OpenAI and you can’t guarantee accurate counts.

The good enough approach™

I’m using Clojure, and can’t use TikToken without jumping through some hoops like setting up clj-python - and even then, accurate token counting is still left as an exercise to the reader because you have to account for all of the above bits that go into the chat completion request.

First, I found the JTokKit Java library — which implements the same token counting algorithm as TikToken.

From there, I gathered all of the information I could, and after some trial and error arrived at solution that has pretty low margin of error when it comes to calculating the token count for the requests that we’re making against Chat Completion API.

Token count growing like grass with the mass appeal

This is where Tolkien comes in. It uses JTokKit under the hood for token counting, and implements its own layer for estimating token count of the whole request payload.

Usage is pretty straightforward:

(require '[tolkien.core :as tokenizer])
(tokenizer/count-chat-completion-tokens
{:model "gpt-3.5-turbo"
:messages [{:role "system"
:content "You're a helpful assistant. You can count things"}
{:role "user"
:content "bananas, apples, raspberries"}]
:functions [{:name :count_items
:parameters {:type "object"
:properties {:explanation {:type "string"}
:result_type {:type "string"
:enum ["number"
"list"]}
:result {:type "integer"}}
:required ["explanation"
:result_type
:result]}}]
:function_call {:name :count_items}})
;; => 79

Note that it deals with keywords and strings, and it also happens that the count is exactly what OpenAI GPT-3.5 model would return.

What’s next?

This is only the first step, now that you’re armed with a tool that can quite reliably estimate your whole prompt size, you can dive deep into the wonderful world of prompt engineering, and play what I call “prompting golf” — be as prescriptive as possible, but with the smallest input possible. Brex’ guide is a great resource, but treat it as a jumping off point.

A real life example

One of the features that I implemented recently is a call transcript summarization. Transcripts can be long, so with the help of Tolkien and the Clojure REPL, I was able to go through the following process:

1. Just use all of the data

Sending the prompt and the transcript as a plain-text table, as per Brex’ recommendation.

It worked in simple cases, but it blew up the context size for longer call transcripts — the complete request size hovering around 20k tokens. One way around this would be to get access to gpt-4-32k model and throw money at the problem, until 32k limit is exceeded. Plus the jump in cost is pretty considerable.

2. Shrink the data without changing the format

Then I “compressed” the transcript, by stitching consecutive parts said by the same participant into single item — the combined prompt shrunk to 15k tokens. Cutting it close, since we have only 1k tokens left for the response so I wasn’t comfortable with that.

Next step was to remove parts of the transcript that didn’t contain anything useful, like “hello” or “mhms”, and then use the “compression”. That in some cases reduced the context to 14k, but wasn’t reliable enough.

3. Keep compressed data, but change the format

Writing prompts is a weird art: it kinda looks like programming, but it also requires you to assume a role of the worlds biggest micromanager. I ended up explaining that the transcript is formatted as a list of:

<user ID alias, integer>: <text, possibly multiline>

and ditched the Markdown-formatted tables.

The ID alias is a simple technique: rather than sending “raw” user IDs, I aliased them as integers, and unalias them back to real user IDs based on the data from the AI’s response.

Couple more tricks and was able to fit the longest transcripts at hand into 8k tokens (prompt + function call + JSON schema included), use GPT3.5 model with a high level of accuracy and leave GPT4 for bigger and more demanding tasks.

More approaches?

If you’re still struggling, having the token counter is essential — just so that you can find your way through implementing various chunking algorithms and recursive summarization.

Have fun!

And if you find any issues, or want to improve Tolkien — head to GitHub.

If you found this article helpful, you may also like what we do at Collie.

Collie helps engineering managers have better conversations and make the most out of every meeting.

--

--

Łukasz Korecki
Łukasz Korecki

Written by Łukasz Korecki

CTO & Co-founder of https://collieapp.com Also living room musician, kitchen scientist. Street Fighter 6 enjoyer

No responses yet