Token efficiency with structured output from language models

Published in

Data Science at Microsoft

14 min readJul 30, 2024

By Bryce Williams and Brendan Vande Kieft

In the ever-evolving field of Artificial Intelligence, the efficient utilization of tokens in generating structured outputs from Large Language Models (LLMs) is a key area of research. This article presents an analysis of methods to optimize token usage, particularly in the creation of JSON and YAML formats. It also introduces innovative constrained generation techniques that promise to revolutionize the way we approach structured data generation. By exploring the balance between token efficiency and output accuracy, this work lays the groundwork for more resourceful and effective use of language models in various applications.

Structured outputs from LLMs are critical for the integration of generative AI into existing programmatic business applications. For example, AI shopping assistants must produce valid queries against search APIs and must structure customer selections for items, quantities, and shipping information to be processed by downstream ordering processes. Typically, the use of these existing APIs and processes requires the AI assistants to generate strictly formatted and fully specified structured objects. In some cases, these interactions may have significant time pressure, in which data is changing quickly (like stock information) or where response time is critical (like a customer-facing chatbot). Minimizing the size of the structured outputs reduces the number of generated tokens and the overall response time, improving the effectiveness of AI assistants.

Here, we explore four methods for creating structured outputs with GPT-4o in the Azure Open AI Service: JSON message, JSON mode message, YAML message, and function calling. Each method uses the same base prompts to create a structured order object according to an order schema for a fictional restaurant based on a customer request. The schema adherence, order accuracy, and token usage for each response is evaluated and compared. In addition, we present an alternative approach using constrained generation through the guidance Python package and a local small language model (Phi-3-mini-4k-instruct).

Menu and schema

To enable the response demonstrations, we created a menu and order schema for the fictional restaurant “Contoso Burger.” Both the menu and the order schema are simplified when compared to most real-world applications but serve as examples for the concepts being explored.

Contoso Burger Menu

Burgers
Name: Hamburger or Cheeseburger
Size: 1/4 lb, 1/2 lb
Bun: Sesame, Pretzel
Cook: Normal, Well Done
Toppings: Lettuce, Tomato, Onion, Pickle, Bacon, Mayo, Bacon, Ketchup, Mustard, Relish

Fries
Size: Small, Medium, Large

Drinks
Name: Cola, Diet Cola, Lemon-Lime, Root Beer
Size: Small, Medium, Large

Order schema

{ 
"type": "object",
"parameters": {
 "type": "object",
 "properties": {
  "order_items": {
   "type": "array",
   "items": {
    "type": "object",
    "properties": {
     "name": {
      "type": "string",
      "description": "Name of item to order."
     },
     "size": {
      "type": "string",
      "description": "Size of item to order."
     },
     "toppings": {
      "type": "array",
      "items": {
       "type": "object",
       "properties": {
        "name": {
         "type": "string",
         "description": "Name of topping to add to item."
        },
        "amount": {
         "enum": ["none", "half", "normal", "double"],
         "description": "Amount of topping to add to item."
        }
       }
      }
     },
     "bun": {
      "type": "string",
      "description": "Type of bun to use for item, required for burgers"
     },
     "cook": {
      "type": "string",
      "description": "Cooking preference for item, required for burgers"
     },
     "quantity": {
      "type": "integer",
      "description": "Number of items to order"
     }
    }
   }
  }

The system prompt and first user message are shown below. These prompts are utilized throughout the experiments to create the structured order outputs.

>> system_prompt = \
>> f"""
>> You are an order generation assistant for Contoso Burger. The user will 
>> provide a specific order and you will generate a structured order ?
>> according to the provided schema.
>> Ensure that the order adheres to the schema in all cases. 
>>
>> Use only the menu below to fulfill the users requests with special
>> attention paid to the allowed sizes and options for each item. Ignore any 
>> items not on the menu that the user may request.
>> {menu_text}
>> """
>>
>> user_prompt = \
>> """
>> Can I get a cheeseburger with pickles, ketchup, and mustard, 2 hamburgers 
>> with ketchup and pickles, one with pretzel bun and one with sesame,
>> two small fries, two larger fries, a large cola, and 3 small diet colas?
>> """

Structured LLM responses

The system and user prompts above were utilized (with slight modifications in some cases) to generate a structured order object from a GPT-4o model deployed in an Azure OpenAI Service endpoint. The returned object is loaded into a Python dictionary representation and compared to other orders. The deepdiff Python package is used to compare the parsed orders from different response modes. It provides equality of the dictionaries with more configurability than the default equality operator, namely ordering in this case.

JSON message

JSON output can be produced within the standard “message” response of the model by adding instructions to use the provided JSON schema for the response. To elicit the required JSON schema, the system prompt is modified with the addition of response requirements and the expected order schema.

>> json_system_prompt = system_prompt + \
>> f"""
>> Use the JSON schema below to generate a structured order. It is critical  
>> to adhere to this schema in all cases.
>> DO NOT include ```json``` in your response.
>>
>> # Order Schema
>> {menu_json["parameters"]}
>> """
>>
>> json_response = client.chat.completions.create(
>>    model=MODEL,
>>    messages=[
>>        {"role": "system", "content": json_system_prompt},
>>        {"role": "user", "content": user_prompt}
>>    ],
>>    temperature=TEMP,
>> )
>> json_order = json.loads(json_response.choices[0].message.content)
>> json_order

{'order_items': [{'name': 'Cheeseburger',
   'size': '1/4 lb',
   'bun': 'Sesame',
   'cook': 'Normal',
   'toppings': [{'name': 'Pickle', 'amount': 'normal'},
    {'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Mustard', 'amount': 'normal'}],
   'quantity': 1},
  {'name': 'Hamburger',
   'size': '1/4 lb',
   'bun': 'Pretzel',
   'cook': 'Normal',
   'toppings': [{'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Pickle', 'amount': 'normal'}],
   'quantity': 1},
  {'name': 'Hamburger',
   'size': '1/4 lb',
   'bun': 'Sesame',
   'cook': 'Normal',
   'toppings': [{'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Pickle', 'amount': 'normal'}],
   'quantity': 1},
  {'name': 'Fries', 'size': 'Small', 'quantity': 2},
  {'name': 'Fries', 'size': 'Large', 'quantity': 2},
  {'name': 'Cola', 'size': 'Large', 'quantity': 1},
  {'name': 'Diet Cola', 'size': 'Small', 'quantity': 3}]}

Upon inspection, the as-produced order from the JSON message response seems to meet the order expectations from the user prompt. For all responses, completion tokens are of most interest and the JSON message response required 370 completion tokens.

>> json_response.usage

CompletionUsage(completion_tokens=370, prompt_tokens=516, total_tokens=886)

JSON message with JSON mode

With some OpenAI models, Structured JSON output can also be heavily enforced using the response_format argument. This enables JSON-mode, which constrains the model to generate strings that parse into valid JSON objects. The prompt is the same as the JSON message case. The result object from the JSON mode message is shown below.

>> json_mode_response = client.chat.completions.create(
>>    model=MODEL,
>>    messages=[
>>        {"role": "system", "content": json_system_prompt},
>>        {"role": "user", "content": user_prompt}
>>    ],
>>    temperature=TEMP,
>>    response_format={"type": "json_object"},
>> )
>> json_mode_order = json.loads(json_mode_response.choices[0].message.content)
>> json_mode_order

{'order_items': [{'name': 'Cheeseburger',
   'size': '1/4 lb',
   'bun': 'Sesame',
   'cook': 'Normal',
   'toppings': [{'name': 'Pickle', 'amount': 'normal'},
    {'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Mustard', 'amount': 'normal'}],
   'quantity': 1},
  {'name': 'Hamburger',
   'size': '1/4 lb',
   'bun': 'Pretzel',
   'cook': 'Normal',
   'toppings': [{'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Pickle', 'amount': 'normal'}],
   'quantity': 1},
  {'name': 'Hamburger',
   'size': '1/4 lb',
   'bun': 'Sesame',
   'cook': 'Normal',
   'toppings': [{'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Pickle', 'amount': 'normal'}],
   'quantity': 1},
  {'name': 'Fries', 'size': 'Small', 'quantity': 2},
  {'name': 'Fries', 'size': 'Large', 'quantity': 2},
  {'name': 'Cola', 'size': 'Large', 'quantity': 1},
  {'name': 'Diet Cola', 'size': 'Small', 'quantity': 3}]}

With deepdiff, it is observed that the json_order and json_mode_order represent the same order items with no differences detected when item ordering is ignored.

>> DeepDiff(json_order, json_mode_order, ignore_order=True)

{}

Token usage, including completion tokens, with JSON mode message is the exact same as JSON message indicating that JSON mode is implemented outside of prompting or generation and has no effects on the token usage in this case.

>> json_mode_response.usage

CompletionUsage(completion_tokens=370, prompt_tokens=516, total_tokens=886)

YAML message

LLM output into YAML definitions is a popular alternative to JSON due to its minimal syntax — which may reduce token usage — and applicability to object streaming (YAML vs JSON: Which is more efficient, LLM Reliability: JSON vs YAML). To test YAML output, we modified the system prompt to include instructions to respond with valid YAML while maintaining the order schema description in JSON.

>> import yaml
>>
>> yaml_system_prompt = system_prompt + \
>> f"""
>> Use the schema below to generate a structured order in properly formatted YAML. 
>> It is critical to adhere to this schema with the YAML response in all cases. DO NOT 
>> include ```yaml``` in your response.
>>
>> # Order Schema
>> {menu_json["parameters"]["properties"]}
>> """

The YAML Message response is parsed into a Python dictionary using pyYAML without error.

>> yaml_response = client.chat.completions.create(
>>    model=MODEL,
>>    messages=[
>>        {"role": "system", "content": yaml_system_prompt},
>>        {"role": "user", "content": user_prompt}
>>    ],
>>    temperature=TEMP,
>> )
>> yaml_order = yaml.safe_load(yaml_response.choices[0].message.content)
>> yaml_order

{'order_items': [{'name': 'Cheeseburger',
   'size': '1/4 lb',
   'bun': 'Sesame',
   'cook': 'Normal',
   'toppings': [{'name': 'Pickle', 'amount': 'normal'},
    {'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Mustard', 'amount': 'normal'}],
   'quantity': 1},
  {'name': 'Hamburger',
   'size': '1/4 lb',
   'bun': 'Pretzel',
   'cook': 'Normal',
   'toppings': [{'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Pickle', 'amount': 'normal'}],
   'quantity': 1},
  {'name': 'Hamburger',
   'size': '1/4 lb',
   'bun': 'Sesame',
   'cook': 'Normal',
   'toppings': [{'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Pickle', 'amount': 'normal'}],
   'quantity': 1},
  {'name': 'Fries', 'size': 'Small', 'quantity': 2},
  {'name': 'Fries', 'size': 'Large', 'quantity': 2},
  {'name': 'Cola', 'size': 'Large', 'quantity': 1},
  {'name': 'Diet Cola', 'size': 'Small', 'quantity': 3}]}

With deepdiff, the order is found to be the same as the previous JSON responses.

DeepDiff(json_order, yaml_order, ignore_order=True)

{}

The YAML message completion token usage was 260, about 100 tokens fewer than both of the JSON responses.

>> yaml_response.usage
CompletionUsage(completion_tokens=260, prompt_tokens=516, total_tokens=776)

Function calling

The final response model leverages functions within OpenAI’s function calling feature available in the Assistants API. Here, the menu schema is passed in as the required arguments to a function named create_order within the tools argument. The tool_choice argument is used to require a call to create_order as the model response.

>> function_response = client.chat.completions.create(
>>    model=MODEL,
>>    messages=[
>>        {"role": "system", "content": system_prompt},
>>        {"role": "user", "content": user_prompt}
>>    ],
>>    temperature=TEMP,
>>    tool_choice = {"type": "function", "function": {"name": "create_order"}},
>>    tools=[
>>        {
>>            "type": "function",
>>            "function": {
>>                "name": "create_order",
>>                "description": "Creates a structured order from a user's request",
>>                "parameters": menu_json["parameters"]
>>            }
>>        }
>>    ]
>> )
>> function_order = json.loads(function_response.choices[0].message.tool_calls[0].function.arguments)
>> function_order

{'order_items': [{'name': 'Cheeseburger',
   'size': '1/4 lb',
   'toppings': [{'name': 'Pickle', 'amount': 'normal'},
    {'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Mustard', 'amount': 'normal'}],
   'bun': 'Sesame',
   'cook': 'Normal',
   'quantity': 1},
  {'name': 'Hamburger',
   'size': '1/4 lb',
   'toppings': [{'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Pickle', 'amount': 'normal'}],
   'bun': 'Pretzel',
   'cook': 'Normal',
   'quantity': 1},
  {'name': 'Hamburger',
   'size': '1/4 lb',
   'toppings': [{'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Pickle', 'amount': 'normal'}],
   'bun': 'Sesame',
   'cook': 'Normal',
   'quantity': 1},
  {'name': 'Fries', 'size': 'Small', 'quantity': 2},
  {'name': 'Fries', 'size': 'Large', 'quantity': 2},
  {'name': 'Cola', 'size': 'Large', 'quantity': 1},
  {'name': 'Diet Cola', 'size': 'Small', 'quantity': 3}]}

The resulting order in YAML format is the same as the first json_order with no differences detected.

>> DeepDiff(json_order, function_order, ignore_order=True)

{}

Token usage using function calling is the lowest observed at 213 completion tokens.

>> function_response.usage

CompletionUsage(completion_tokens=213, prompt_tokens=418, total_tokens=631)

Analysis

This analysis section focuses on practical application of the methods discussed above. It details the experiments conducted using the GPT-4o model in the Azure OpenAI Service to generate structured outputs for the menu of Contoso Burger. The experiments compare different methods of generating JSON and YAML messages, as well as function calling, to determine the most token-efficient approach. This analysis section also examines the use of the Python package guidance and a local small language model for constrained generation, highlighting the importance of schema adherence and order accuracy in the generation process. The results demonstrate the effectiveness of function calling in reducing the number of completion tokens required, thereby optimizing the model’s performance for real-world applications.

Token usage by response

The token usage for the four response modes is shown in the table and plot below. A percentage difference was calculated for each response mode using JSON message as a baseline. JSON message and JSON mode message required the most completion tokens at 370 tokens. The YAML message response required 30 percent fewer tokens than the baseline JSON message. Lastly, function calling used the least completion tokens to complete the order with a 42 percent reduction in completion tokens compared to JSON message.

Response token usage table

Function calling efficiency

It is worth a closer look to see why function calling produces a JSON response that is much more token efficient compared to the JSON approaches. Using tiktoken with the GPT-4o tokenizer, the function call and JSON message responses were broken down into their constituent tokenIds and text representations. With the token breakdowns, the unique tokens present in the responses are obtained by finding the set differences of each response mode.

Unique JSON message tokens

The unique tokens present in the JSON message response are shown below. Many of the tokens are white space characters, control characters, and various combinations of brackets and other characters. These results show the wide variety of applicable tokens available to create valid JSON. The “inefficient” JSON created in this response mode is likely due to the presence of large amounts of “pretty-print” JSON character structures within the training corpus. Within written and automated documentation, JSON objects are often structured with abundant control characters and indention to improve human readability, but which impart no benefits for machine interpretation. The default output for JSON structures results in these readily human-readable structures.

Unique function calling tokens

The unique tokenIds found in the function calling response are shown below. Many of the tokens unique to the function calling response are compounds of JSON syntax (e.g., closing and opening key-values, “:”). These unique character combinations reduce the number of tokens required for specifying the JSON syntax by removing white space and control characters resulting in a minimized JSON structure.

Making JSON responses more efficient

As an experiment, the token differences observed above may be used to improve the efficiency of the JSON responses outside of function calling. The logit-bias parameter of the OpenAI API allows modification of the selection likelihood of specific tokens. Here, we reduce the logit-bias for the inefficient white space and control character tokens to reduce their presence while increasing the logit-bias for the optimized compound tokens.

>> logit_bias = {tokenId: -100 for tokenId in unique_json_tokens["TokenId"]} 
>> logit_bias.update({tokenId: 10 for tokenId in unique_func_tokens["TokenId"]})

The logit-bias values are then added to the API call with the previously used JSON response prompts.

>> json_logit_response = client.chat.completions.create(
>>     model=MODEL,
>>     messages=[
>>         {"role": "system", "content": json_system_prompt},
>>         {"role": "user", "content": user_prompt}
>>     ],
>>     logit_bias=logit_bias,
>>     temperature=0,
>> )
>> json_logit_order = json.loads(json_logit_response.choices[0].message.content)
>> json_logit_order

{'order_items': [{'name': 'Cheeseburger',
   'size': '1/4 lb',
   'toppings': [{'name': 'Pickle', 'amount': 'normal'},
    {'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Mustard', 'amount': 'normal'}],
   'bun': 'Sesame',
   'cook': 'Normal',
   'quantity': 1},
  {'name': 'Hamburger',
   'size': '1/4 lb',
   'toppings': [{'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Pickle', 'amount': 'normal'}],
   'bun': 'Pretzel',
   'cook': 'Normal',
   'quantity': 1},
  {'name': 'Hamburger',
   'size': '1/4 lb',
   'toppings': [{'name': 'Ketchup', 'amount': 'normal'},
    {'name': 'Pickle', 'amount': 'normal'}],
   'bun': 'Sesame',
   'cook': 'Normal',
   'quantity': 1},
  {'name': 'Fries', 'size': 'Small', 'quantity': 2},
  {'name': 'Fries', 'size': 'Large', 'quantity': 2},
  {'name': 'Cola', 'size': 'Large', 'quantity': 1},
  {'name': 'Diet Cola', 'size': 'Small', 'quantity': 3}]}

The accuracy of the order object is unaffected by the change in token logit-bias.

>> DeepDiff(function_order, json_logit_order, ignore_order=True)

{}

However, the addition of the logit-bias shaping resulted in a reduction in the completion tokens required to produce the order, going from 370 in the standard JSON message response to 213 with the logit-bias modifications.

>> json_logit_response.usage

CompletionUsage(completion_tokens=213, prompt_tokens=516, total_tokens=729)

Adding the JSON logit results to the usage data shows that the new results match the completion tokens of function calling with a higher number of prompt tokens.

Response token usage table

Constrained generation with guidance Python package

The LLM response approaches above rely on full specification of the structured object from the LLM. Often, many aspects of the schema are required and deterministic (e.g., property names, enumerated types, and so on). These elements do not benefit from LLM generation and even represent challenges for the probabilistic nature of language models. Constrained generation, as exemplified in the Python package guidance, is an emerging approach to the creation of structured LLM responses. Here, token selection is programmatically shaped and even determined by constraints provided by the developer. However, this approach requires strict monitoring of token-by-token selection necessitating tight integration with the deployed model. Currently, for the most part this limits the application of guidance to locally deployed Small Language Models (SLMs).

Below is a simple example of using guidance with a quantized version of Phi-3-mini-4k-instruct to create a structured order response. A Pydantic model was used to define the required output and passed to guidance for response constraints.

>> from guidance import models, capture
>> from llama_cpp import Llama
>> from model import Order
>> phi3_cpp = Llama.from_pretrained(repo_id="microsoft/Phi-3-mini-4k-instruct-gguf", filename="Phi-3-mini-4k-instruct-q4.gguf", n_ctx=4096)
>>
>>
>> phi3 = models.LlamaCpp(phi3_cpp) 
>>
>> assistant_msg = f"<|system|>Use this menu\n{Order.model_json_schema()} to construct the users order in JSON format.\n"
>>
>> user_msg = "<|user|>\n" + user_prompt + "\n<|assistant|>\n"
>>
>> response = phi3 + assistant_msg + user_msg + capture(guidance.json(name="order",schema=Order), "order")

With guidance, only the green highlighted characters were generated by the SLM with all other text programmatically inferred by the constraints of the order schema. Model generation is used only when the next token is non-deterministic (e.g., first token of topping name). Once the unique token is selected by the model, the remaining tokens are programmatically inserted using the schema.

The order produced using guidance with Phi3-mini-4k-instruct is less accurate compared to the previous orders using GPT-4o. However, it is quite close considering the significant difference in model sizes. Additionally, guidance greatly improved the order accuracy compared to using the model directly to produce the order without constraints.

>> DeepDiff(json_order, guidance_order, ignore_order=True)

{'values_changed': {"root['order_items'][0]['size']": {'new_value': '1/2 lb',
   'old_value': '1/4 lb'},
  "root['order_items'][0]['bun']": {'new_value': 'Pretzel',
   'old_value': 'Sesame'},
  "root['order_items'][2]['name']": {'new_value': 'Cheeseburger',
   'old_value': 'Hamburger'},
  "root['order_items'][2]['size']": {'new_value': '1/2 lb',
   'old_value': '1/4 lb'}},
 'iterable_item_added': {"root['order_items'][2]": {'name': 'Hamburger',
   'size': '1/4 lb',
   'quantity': 2,
   'bun': 'Sesame',
   'cook': 'Normal',
   'toppings': [{'name': 'Pickle', 'amount': 'normal'}]}},
 'iterable_item_removed': {"root['order_items'][1]['toppings'][1]": {'name': 'Pickle',
   'amount': 'normal'}}}

The guidance order used 137 completion tokens while the full structured response is 292 tokens. The programmatic token selection informed by the provided schema reduced the generated tokens by more than half.

The guidance results have been added to the token usage tables and plots below. It is important to note that Phi-3 and GPT-4o have different tokenizers and token vocabularies. Because of this, the comparisons are not exact comparisons but are representative of inferences required to specify the structured orders.

Response token usage table

Conclusion

In this article we have shown that function calling offers the best out-of-the-box token efficiency for the creation of structured objects due to its inherent efficient token selection for JSON output. The token efficiency of function calling can be replicated by adjusting the token logit-bias to reduce the presence of white space and control characters, but this may not be amenable to all applications. While less token efficient than function calling, we found that YAML outputs reduced token usage compared to unadjusted JSON responses and could be a good choice for suitable applications that benefit from the streaming properties of YAML. In the future, we believe constrained generation approaches, like the guidance example shown here, offer immense promise to further reduce token usage for structured object output with LLMs.

Token efficiency with structured output from language models

Menu and schema

Menu

Order schema

Structured LLM responses

JSON message

JSON message with JSON mode

YAML message

Function calling

Analysis

Token usage by response

Function calling efficiency

Making JSON responses more efficient

Constrained generation with guidance Python package

Conclusion

Written by Bwilliams