When invoking APIs hosted by Azure API Management configured Azure OpenAI Service as a backend service with requests where stream is set to true, the APIs might return HTTP 500

Akihiro Nishikawa

Published in

Microsoft Azure

4 min readSep 12, 2023

This entry is as of September 12, 2023, and the original article was published in Japanese.

Azure OpenAI Serviceをバックエンドサービスとして構成したAzure API ManagementのAPIに対し、Streamを有効にしたリクエストを投げたらHTTP…

このエントリは2023/09/12現在の情報に基づいています。将来の機能追加や変更に伴い、記載内容からの乖離が...

logico-jp.io

Query

We use Azure API Management (APIM) to host APIs as a façade of Azure OpenAI Service (AOAI). Application Gateway is also used for load balancing of backend service. On this environment, we tried sending requests where stream was set to true, but the APIs returned HTTP 500. Could you please share your thoughts and workarounds with us?

Stream mode of AOAI is the same as OpenAI’s. This mode leverages Server Sent Events (SSE) to return responses from AOAI.

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

platform.openai.com

Policy configuration in their APIs is listed below.

<policies>
    <inbound>
        <base />
        <authentication-managed-identity resource="https://cognitiveservices.azure.com"
            output-token-variable-name="msi-access-token" ignore-error="false" />
        <set-header name="Authorization" exists-action="override">
            <value>@("Bearer " + (string)context.Variables["msi-access-token"])</value>
        </set-header>
        <set-backend-service base-url="{Endpoint for Application Gateway}" />
    </inbound>
    <backend>
        <retry condition="@(context.Response.StatusCode >= 500)" count="3" interval="0" first-fast-retry="true">
            <forward-request buffer-request-body="true" timeout="120" buffer-response="false"
                fail-on-error-status-code="true" />
        </retry>
    </backend>
    <outbound>
        <base />
    </outbound>
    <on-error />
</policies>

Integrate Azure API Management with Application Insights - Azure API Management

Learn how to set up a connection to Application Insights and enable logging for APIs in your Azure API Management…

learn.microsoft.com

And they enabled diagnostic log settings to collect request/response.

Cause

The reason why this issue happened is that they enable diagnostic settings to use Azure Monitor to collect logs. APIM experts might find the root cause easily.

As the document says, buffering can lead to problems in case of APIs that implement SSE.

Avoid logging request/response body for Azure Monitor and Application Insights — You can configure API request logging for Azure Monitor or Application Insights using diagnostic settings. The diagnostic settings allow you to log the request/response body at various stages of the request execution. For APIs that implement SSE, this can cause unexpected buffering which can lead to problems. Diagnostic settings for Azure Monitor and Application Insights configured at the global/All APIs scope apply to all APIs in the service. You can override the settings for individual APIs as needed. For APIs that implement SSE, ensure you have disabled request/response body logging for Azure Monitor and Application Insights.

Configure API for server-sent events in Azure API Management

How to configure an API for server-sent events (SSE) in Azure API Management

learn.microsoft.com

When APIM establishes a connection to AOAI with stream enabled, responses from AOAI flow through APIM. At this case, we must not perform any operations which might lead to buffering in APIM outbound section. Such operations are…

Transferring logs to Azure Monitor using Diagnostic log settings.
Modifying responses using set-body and/or set-header policies where policy expressions are used.

After they disabled diagnostic logs setting, their APIs worked.

Additional questions

They asked me an additional question.

How do we count the number of used tokens when stream is set to true? If stream is set to false, consumed token information is found in the response body...

Actually, we can confirm the number of consumed tokens in response body when stream is set to false.

{
    "id": "chatcmpl-7xqbvCK1E147BjRaOfg1LDlr8bFYY",
    "object": "chat.completion",
    "created": 1694497943,
    "model": "gpt-35-turbo-16k",
    "prompt_annotations": [...],
    "choices": [...],
    "usage": {
        "completion_tokens": 509,
        "prompt_tokens": 19,
        "total_tokens": 528
    }
}

However, we cannot find the number of consumed tokens in response body when stream is set to true. In case of true, usage is always null.

{
    "id": "chatcmpl-7xr2XYKrXSel2qiK2opU43Qf02OND",
    "object": "chat.completion.chunk",
    "created": 1694499593,
    "model": "gpt-35-turbo-16k",
    "choices": [
        {
            "index": 0,
            "finish_reason": null,
            "delta": {
                "content": "。"
            },
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            }
        }
    ],
    "usage": null
}

In case of streaming enabled, the only way to count the number of consumed tokens is use tokenizer as of now. There are several tokenizers for program languages, such as tiktoken (Python), JTokkit (Java), Tokenizer (C# and TypeScript), etc.

GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models.

tiktoken is a fast BPE tokeniser for use with OpenAI's models. - GitHub - openai/tiktoken: tiktoken is a fast BPE…

github.com

GitHub - knuddelsgmbh/jtokkit: JTokkit is a Java tokenizer library designed for use with OpenAI…

JTokkit is a Java tokenizer library designed for use with OpenAI models. - GitHub - knuddelsgmbh/jtokkit: JTokkit is a…

github.com

GitHub - microsoft/Tokenizer: .NET and Typescript implementation of BPE tokenizer for OpenAI LLMs.

NET and Typescript implementation of BPE tokenizer for OpenAI LLMs. - GitHub - microsoft/Tokenizer: .NET and Typescript…

github.com

Workaround

As of now, we cannot collect prompt logs at APIM due to buffering limitation in APIM when SSE enabled requests come to APIM. As a workaround, Azure Functions (or App Service, etc. is fine) are located behind the APIM to handle SSE request/response and count the number of tokens. As APIM can judge whether each request is SSE enabled or not, APIM can collect prompt logs when set-backend-service policy is used to specify a backend service.

When invoking APIs hosted by Azure API Management configured Azure OpenAI Service as a backend service with requests where stream is set to true, the APIs might return HTTP 500

Azure OpenAI Serviceをバックエンドサービスとして構成したAzure API ManagementのAPIに対し、Streamを有効にしたリクエストを投げたらHTTP…

このエントリは2023/09/12現在の情報に基づいています。将来の機能追加や変更に伴い、記載内容からの乖離が...

Query

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

Integrate Azure API Management with Application Insights - Azure API Management

Learn how to set up a connection to Application Insights and enable logging for APIs in your Azure API Management…

Cause

Configure API for server-sent events in Azure API Management

How to configure an API for server-sent events (SSE) in Azure API Management

Additional questions

GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models.

tiktoken is a fast BPE tokeniser for use with OpenAI's models. - GitHub - openai/tiktoken: tiktoken is a fast BPE…

GitHub - knuddelsgmbh/jtokkit: JTokkit is a Java tokenizer library designed for use with OpenAI…

JTokkit is a Java tokenizer library designed for use with OpenAI models. - GitHub - knuddelsgmbh/jtokkit: JTokkit is a…

GitHub - microsoft/Tokenizer: .NET and Typescript implementation of BPE tokenizer for OpenAI LLMs.

NET and Typescript implementation of BPE tokenizer for OpenAI LLMs. - GitHub - microsoft/Tokenizer: .NET and Typescript…

Workaround

Written by Akihiro Nishikawa