When invoking APIs hosted by Azure API Management configured Azure OpenAI Service as a backend service with requests where stream is set to true, the APIs might return HTTP 500

Akihiro Nishikawa
Microsoft Azure
Published in
4 min readSep 12, 2023

This entry is as of September 12, 2023, and the original article was published in Japanese.

Query

We use Azure API Management (APIM) to host APIs as a façade of Azure OpenAI Service (AOAI). Application Gateway is also used for load balancing of backend service. On this environment, we tried sending requests where stream was set to true, but the APIs returned HTTP 500. Could you please share your thoughts and workarounds with us?

Stream mode of AOAI is the same as OpenAI’s. This mode leverages Server Sent Events (SSE) to return responses from AOAI.

Policy configuration in their APIs is listed below.

<policies>
<inbound>
<base />
<authentication-managed-identity resource="https://cognitiveservices.azure.com"
output-token-variable-name="msi-access-token" ignore-error="false" />
<set-header name="Authorization" exists-action="override">
<value>@("Bearer " + (string)context.Variables["msi-access-token"])</value>
</set-header>
<set-backend-service base-url="{Endpoint for Application Gateway}" />
</inbound>
<backend>
<retry condition="@(context.Response.StatusCode >= 500)" count="3" interval="0" first-fast-retry="true">
<forward-request buffer-request-body="true" timeout="120" buffer-response="false"
fail-on-error-status-code="true" />
</retry>
</backend>
<outbound>
<base />
</outbound>
<on-error />
</policies>

And they enabled diagnostic log settings to collect request/response.

Cause

The reason why this issue happened is that they enable diagnostic settings to use Azure Monitor to collect logs. APIM experts might find the root cause easily.

As the document says, buffering can lead to problems in case of APIs that implement SSE.

Avoid logging request/response body for Azure Monitor and Application Insights — You can configure API request logging for Azure Monitor or Application Insights using diagnostic settings. The diagnostic settings allow you to log the request/response body at various stages of the request execution. For APIs that implement SSE, this can cause unexpected buffering which can lead to problems. Diagnostic settings for Azure Monitor and Application Insights configured at the global/All APIs scope apply to all APIs in the service. You can override the settings for individual APIs as needed. For APIs that implement SSE, ensure you have disabled request/response body logging for Azure Monitor and Application Insights.

When APIM establishes a connection to AOAI with stream enabled, responses from AOAI flow through APIM. At this case, we must not perform any operations which might lead to buffering in APIM outbound section. Such operations are…

  • Transferring logs to Azure Monitor using Diagnostic log settings.
  • Modifying responses using set-body and/or set-header policies where policy expressions are used.

After they disabled diagnostic logs setting, their APIs worked.

Additional questions

They asked me an additional question.

How do we count the number of used tokens when stream is set to true? If stream is set to false, consumed token information is found in the response body...

Actually, we can confirm the number of consumed tokens in response body when stream is set to false.

{
"id": "chatcmpl-7xqbvCK1E147BjRaOfg1LDlr8bFYY",
"object": "chat.completion",
"created": 1694497943,
"model": "gpt-35-turbo-16k",
"prompt_annotations": [...],
"choices": [...],
"usage": {
"completion_tokens": 509,
"prompt_tokens": 19,
"total_tokens": 528
}
}

However, we cannot find the number of consumed tokens in response body when stream is set to true. In case of true, usage is always null.

{
"id": "chatcmpl-7xr2XYKrXSel2qiK2opU43Qf02OND",
"object": "chat.completion.chunk",
"created": 1694499593,
"model": "gpt-35-turbo-16k",
"choices": [
{
"index": 0,
"finish_reason": null,
"delta": {
"content": "。"
},
"content_filter_results": {
"hate": {
"filtered": false,
"severity": "safe"
},
"self_harm": {
"filtered": false,
"severity": "safe"
},
"sexual": {
"filtered": false,
"severity": "safe"
},
"violence": {
"filtered": false,
"severity": "safe"
}
}
}
],
"usage": null
}

In case of streaming enabled, the only way to count the number of consumed tokens is use tokenizer as of now. There are several tokenizers for program languages, such as tiktoken (Python), JTokkit (Java), Tokenizer (C# and TypeScript), etc.

Workaround

As of now, we cannot collect prompt logs at APIM due to buffering limitation in APIM when SSE enabled requests come to APIM. As a workaround, Azure Functions (or App Service, etc. is fine) are located behind the APIM to handle SSE request/response and count the number of tokens. As APIM can judge whether each request is SSE enabled or not, APIM can collect prompt logs when set-backend-service policy is used to specify a backend service.

--

--

Akihiro Nishikawa
Microsoft Azure

Cloud Solution Architect @ Microsoft, and JJUG (Japan Java Users Group) board member. ♥Java (JVM/GraalVM) and open-source technologies. All views are my own.