Use Azure API Management policies to configure retry and fall back to another instance based on HTTP status!

Akihiro Nishikawa
Microsoft Azure
Published in
8 min readAug 2, 2023

--

This entry is as of July 31, 2023, and the original article was published in Japanese.

Query

We are now designing and implementing an internal system based on Azure OpenAI Service (AOAI). As the number of regions where AOAI is available has increased, we plan to distribute requests to AOAI for load balancing. We also want to retry requests to AOAI and fall back to other AOAI instances. Is this concept feasible and at what layer should we implement it?

In fact, I wrote the article about load balancing AOAI, but nobody asked me about the retry and fallback strategy.

What they want to achieve is illustrated below.

Background of this query

Current system architecture trends are...

  • Services should be stateless.
  • The caller should handle exceptions thrown by the callee.

For example, if AOAI returns 429 (rate limit reached), the calling application is often designed to handle this exception and wait some time before calling AOAI again. On the other hand, the strategy they are considering seems a bit old-fashioned. So, I asked them for more details.

- We understand that handling state in the middle layer does not work well for scaling.
- But this system is for internal use, so scaling is not required.
- The reason why we want to design in this way is that this system will be used not only from cloud (Azure) but also from 3rd party clouds and on-premises. We want to reduce cross-premises traffic.

Their API strategy is based on a layered architecture.

Implementation strategy

Two strategies came to mind — handling in API Management or controller applications such as “Business Application” and “Composite Applications” in the picture above. As this is not AOAI specific, the same strategy is applicable in many cases.

1) Implement logics in API Management (APIM)

  • Use the retry policy in the backend section.
  • It is easy because all retry/fallback logics are packaged in APIM.
  • It may be difficult to implement fine-grained retry/fallback logics.

2) Implement logics in controller application (Composite Application or Business Application in the above figure)

  • The logics to be implemented are the same as those for handling REST APIs, which are common when implementing web applications.
  • Not only simple but also fine-grained retry/fallback logics can be implemented.
  • The cost for implementation will be higher than with APIM, and more applications could be managed.

As the latter option 2) is a common strategy for web applications, I would like to write this entry about the former use case 1).

Implementation Tips

Accessing AOAI

The following YouTube movie shows us how to configure APIM to retry/fallback requests to AOAI.

In this video, the content creator uses the send-request policy in the inbound section to obtain a Bearer token to access AOAI. However, when Azure AD authentication is used for AOAI access control, the system-assigned managed identity of APIM allows us to obtain a Bearer token much more easily, as I wrote in Distribute requests to Azure OpenAI Service before.

<authentication-managed-identity resource="https://cognitiveservices.azure.com"
output-token-variable-name="msi-access-token" ignore-error="false" />
<set-header name="Authorization" exists-action="override">
<value>@("Bearer " + (string)context.Variables["msi-access-token"])</value>
</set-header>

Retry/fallback to backend service(s)

a) If there is an L7 load balancer such as Application Gateway between APIM and AOAI instances.

Depending on the load balancing rules defined in the L7 load balancer, if load balancing is done in a pure round-robin manner, requests from APIM will be sent equally to one of the multiple AOAI instances, which can then fall back as a result automatically. So, we could focus on retrying based on HTTP status.

In this entry, policies in backend section are configured according to the rules listed below.

  • Maximum number of calls: 3 (the first call is always executed, so 2 retries are required.)
  • When retry is triggered: HTTP status is 300 or greater.

Please note that only one policy can be configured in the backend section. Therefore,

  • forward-request policy, which has already been defined in the global scope, should be overridden. To do this, remove <base /> from the section and add retry policy to include forward-request policy.
  • To retain request body for retry, buffer-request-body="true" should be specified in the forward-request policy.
<!-- forward request and request body is stored for retry -->
<forward-request buffer-request-body="true" />

The policy set based on the above content is as follows. Values in double brackets {{}} refer to the Named Value.

<!-- retryCount is specified in Named Values. -->
<policies>
<inbound>
<base />
<set-variable name="retryCount" value="@(int.Parse(" {{retryCount}}"))" />
<set-variable name="maxRetryCount" value="@((int)context.Variables[" retryCount"] -1)" />
<authentication-managed-identity resource="https://cognitiveservices.azure.com"
output-token-variable-name="msi-access-token"ignore-error="false" />
<set-header name="Authorization" exists-action="override">
<value>@("Bearer " + (string)context.Variables["msi-access-token"])</value>
</set-header>
</inbound>
<backend>
<retry condition="@(context.Response.StatusCode >= 300)" count="@((int)context.Variables[" maxRetryCount"])"
interval="1" max-interval="10" delta="1" first-fast-retry="false">
<!-- forward request and request body is stored for retry -->
<forward-request buffer-request-body="true" />
</retry>
</backend>
<outbound>
<base />
</outbound>
<on-error>
<base />
</on-error>
</policies>

b) When configuring AOAI instances as APIM backend services (APIM can call AOAI instances directly).

In this case, since the backend service is specified directly in APIM, we can use APIM policies to configure request retry and fallback based on HTTP status. The base URL of the backend service can be overridden using the set-backend-service policy.

<policies>
<inbound>
<base />
...
<set-backend-service base-url="Base URL" />
</inbound>
...
</policies>

Named Values allow us to externalise the definition of constants. In this entry, the base URLs are stored in Named Values since we don’t have to modify policies along with modifying AOAI base URLs. For more details on Named Values, see the following URL.

(Added on January 24, 2024)

Load-balanced pool is introduced in APIM as a private preview. We can use this feature to distribute traffic to multiple backend services without L7 load balancer as well as complicated policy setting in inbound section. Currently only round-robin distribution is available.

Policies in the backend section

In this entry, the request is forwarded to the base URL of the backend service specified by the set-backend-service policy in the inbound section. Retries and fallbacks are configured according to the following rules.

Number of attempts for each URL: 3 (the first call is always executed, so 2 retries are required.) In this example, the number of base URLs is 3, so a maximum of 9 calls will be made.

If returned HTTP status is smaller than 300: the response from AOAI is returned as the response from APIM.

If returned HTTP status is greater than 300: increment the loop counter (up to 3 attempts are made for each URL). After the third call, the base URL is changed.

If HTTP status 429 is returned: initialise the loop counter and change the base URL to forward requests.

All returned HTTP status is greater than 300: set the retry policy continuation flag to false and exit the retry loop.

a) Retry

The retry condition, maximum number of retries, retry interval, etc. should be specified in the retry policy.

<retry condition="Boolean expression or literal"
count="number of retry attempts"
interval="retry interval in seconds"
max-interval="maximum retry interval in seconds"
delta="retry interval delta in seconds"
first-fast-retry="boolean expression or literal">
<!-- One or more child policies. No restrictions. -->
</retry>

In this entry, I have specified max-interval and delta to the policy, so the retry intervals are increased exponentially.

b) Forwarding requests

To retry with the same request body, forward-request policy should be overridden in the global scope. This configuration is the same as for L7 Load Balancer case above.

<!-- forward request and request body is stored for retry -->
<forward-request buffer-request-body="true" />

Additionally, we can also use the attribute the fail-on-error-status-code to route and process requests/responses to the on-error section when AOAI returns 400 or greater. However, this configuration is not used in this entry as I want to focus on retry and fallback.

The policy set based on the above content is as follows. Values in double brackets {{}} refer to the Named Value.

<policies>
<inbound>
<base />
<!-- URLs retrieved from Named values -->
<set-variable name="URL" value="@(JArray.FromObject("{{URLs}}".Split(',')))" />
<!-- # of URLs retrieved from Named values -->
<set-variable name="urlCount" value="@(((JArray)context.Variables["URL"]).Count)" />
<!-- Max # of retries for each URL -->
<set-variable name="retryCount" value="@(int.Parse("{{retryCount}}"))" />
<!-- Max # of retries in retry policy -->
<set-variable name="maxRetryCount" value="@((int)context.Variables["retryCount"] * (int)context.Variables["urlCount"])" />
<!-- Loop Counter for URLs -->
<set-variable name="urlLoop" value="@(0)" />
<!-- Invoked URL for visibility -->
<set-variable name="OpenAI-Instance-Invoked" value="@{
JArray jarray = (JArray)context.Variables["URL"];
return jarray[(int)context.Variables["urlLoop"]].ToString();
}" />
<!-- Initialize backend service URL -->
<set-backend-service base-url="@((string)context.Variables["OpenAI-Instance-Invoked"])" />
<set-variable name="attempt" value="@(0)" />
<set-variable name="continue" value="@(true)" />
</inbound>
<backend>
<!-- Condition: HTTP Status >= 300 and continue == true -->
<retry condition="@(context.Response.StatusCode >= 300 && ((bool)context.Variables["continue"]))" count="@((int)context.Variables["maxRetryCount"])" interval="1" max-interval="10" delta="1" first-fast-retry="false">
<!-- forward request and request body is stored for retry -->
<forward-request buffer-request-body="true" />
<!-- Increment # of attempts -->
<set-variable name="attempt" value="@((int)context.Variables["attempt"] + 1)" />
<choose>
<!-- In case of 429 -->
<when condition="@(context.Response.StatusCode == 429)">
<set-variable name="attempt" value="@(0)" />
</when>
<!-- In other cases, no operation. -->
<otherwise />
</choose>
<choose>
<!-- If # of attempts can be divided by 3, URL should be changed. -->
<when condition="@((int)context.Variables["attempt"] % (int)context.Variables["retryCount"] == 0)">
<set-variable name="urlLoop" value="@((int)context.Variables["urlLoop"] + 1)" />
<choose>
<!-- If at least one URL for trial exists -->
<when condition="@((int)context.Variables["urlLoop"] < (int)context.Variables["urlCount"])">
<set-variable name="OpenAI-Instance-Invoked" value="@{
JArray jarray = (JArray)context.Variables["URL"];
return jarray[(int)context.Variables["urlLoop"]].ToString();
}" />
<set-backend-service base-url="@((string)context.Variables["OpenAI-Instance-Invoked"])" />
<set-variable name="attempt" value="@(0)" />
</when>
<!-- If no URL for trial exists -->
<otherwise>
<set-variable name="OpenAI-Instance-Invoked" value="All URLs were called but no response." />
<set-variable name="continue" value="@(false)" />
</otherwise>
</choose>
</when>
<!-- In other cases, no operation. -->
<otherwise />
</choose>
</retry>
</backend>
<outbound>
<base />
</outbound>
<on-error>
<base />
</on-error>
</policies>

Conclusion

APIM policies allow us to configure retries and fallbacks, which can be finely controlled by HTTP status. However, it is true that there are some aspects that go against the principle that the configuration and logic should be as simple as possible. Therefore, we should determine when and where to apply them.

--

--

Akihiro Nishikawa
Microsoft Azure

Cloud Solution Architect @ Microsoft, and JJUG (Japan Java Users Group) board member. ♥Java (JVM/GraalVM) and open-source technologies. All views are my own.