Distribute requests to Azure OpenAI Service

Akihiro Nishikawa
Microsoft Azure
Published in
9 min readJun 10, 2023

The original article was published in Japanese, and this is a memo for me as of January 24, 2024.

Lots of customers and prospects are interested in implementing systems powered by Azure OpenAI Service (AOAI). As they have asked me about component deployment topology, I write down some configuration. To tell the truth, this is not anything special but one of common practices. So, this concept can be applied to most of systems.

Query

We want to use AOAI to integrate AI capabilities into our business. However, with the current limitations of AOAI, we would soon hit the rate limit, so we would like to manage to distribute requests across multiple instances. Do you have any recommendations?

Quota and limits of AOAI and models available in regions are described at the following URLs.

For example, the GPT-4 model can accept up to 18 requests per minute, but if you feel 18 requests/minute is a bit too few, it is common that you want to distribute requests across multiple AOAI instances. You can also request to relax the limit of AOAI, of course, but this may not be granted immediately.

It should be noted that the Provisioned Throughput model was announced in Build 2023, but nothing concrete has been discussed as of June 8, 2023.

Note

Quota was changed on June 9, 2023. According to the latest quota document, default TPM quota per model and region is defined, while default TPM quota per model was defined previously.

In case of gpt-3.5-turbo, for example, previously we could provision up to 3 instances in each region and default TPM quota per instance was 120k, so maximum TPM per region was 360k (=3 instances * 120k TPM / instance) by default. Based on the latest quota, however, gpt-3.5-turbo is categorized into “all others” and default TPM quota per region has been already defined (240k). Indeed, we can ask for increasing quota, but default TPM quota for gpt-3.5-turbo was decreased.

How to use AOAI

An API key or a Bearer token should be passed in the HTTP header to invoke the AOAI API in the application or API Management. The following image is that application(s) call AOAI directly. This is the simplest style.

a. Calling AOAI directly

If direct use of AOAI is not preferred, we can also use API Management as a façade for AOAI instance(s). In this case, an application does not need to know the AOAI API key as API Management hides AOAI related operations.

b. API Management acts as a façade

Which is preferred, API Key or Azure role-based access control (Azure RBAC)?

Azure RBAC is strongly recommended. The reasons are listed below.

  • As there is no need to use API Keys, there is no need to worry about API Key leakage.
  • It can overwhelmingly lower operation and management costs.

When configuring Azure RBAC against AOAI in API Management inbound section,

  1. Generate a managed identity for the API Management instance.
  2. Assign the identity to Cognitive Services User role for each AOAI instance.
  3. Obtain a Bearer token in the inbound section through the authentication-managed-identity policy.
  4. Set the Bearer token to the HTTP header of the request to invoke the back-end service (AOAI).

The following is a snippet using the authentication-managed-identity policy.

<authentication-managed-identity resource="https://cognitiveservices.azure.com"
output-token-variable-name="msi-access-token" ignore-error="false" />
<set-header name="Authorization" exists-action="override">
<value>@("Bearer " + (string)context.Variables["msi-access-token"])</value>
</set-header>

For more details, please check the following URL.

In case of API key, on the other hand, the key would be stored as a secret in a vault like Key Vault and referred to if needed. If API key is regenerated, the key(s) stored in the vault should be updated manually.

Load balancing

[1] Based on caller apps

We can use an API Key API Management issues to determine the caller. Based on the key, API Management can route requests to a desired AOAI. Please note that this does not provide equal request distribution.

[2] Based on a unique value related with requests

We can configure more equal load balancing if using a unique value such as request ID to determine a backend service (AOAI). Topology is the same as [1].

[3] Using Load-balanced pool in API Management (Preview)

As of January 23, 2024, load-balanced pool is available in API Management (public preview).

This pool allows us to configure multiple backend services to distribute requests to each backend equally. Currently only round-robin distribution is supported.

[4] Using load balancing service for backend services

We have several options, and the following services are available in Azure. Select the appropriate service depending on the type of load balancing is required, intra-regional or inter-regional.

  • Load Balancer (L4) regional (inter-regional is in preview)
  • Application Gateway (L7) regional
  • Front Door (L7) inter-regional (global)
  • Traffic Manager (DNS) inter-regional (global)

We can choose Application Gateway for regional load balancing, and Front Door for cross-region load balancing. The reason why other options are out of scope is …

  • Load Balancer as HTTP keep-alive is often enabled, L4 load balancers don’t provide equal request distribution. Furthermore, Azure Standard Load Balancer does not support load balancing across Private Endpoints.
  • Traffic Manager —AOAI does not accept any accesses with other FQDN than original FQDN.

Reduce edge latency

If the application is used in multiple locations, messages are redirected from the edge to the nearest origin to reduce latency. This concept is also used for disaster recovery. In such cases, we can use one of the following options.

[1] Use Premium SKU to deploy API Management in multiple Azure regions

Multi-region API Management deployment means that the API gateway functionality is deployed in multiple regions, while the management service exists only in the primary region.

As quoted below, the primary gateway endpoint has a Traffic Manager equivalent functionality that routes requests to the regional gateway where edge latency can be minimized.

When API Management receives public HTTP requests to the primary gateway endpoint, traffic is routed to a regional gateway based on lowest latency, which can reduce latency experienced by geographically distributed API consumers.

https://learn.microsoft.com/azure/api-management/api-management-howto-deploy-multi-region#about-multi-region-deployment

However, it is expensive (which is probably the biggest concern), as the Premium SKU only has this functionality. It also needs to be configured to call local backend services in the region.

By default, each API routes requests to a single backend service URL. Even if you’ve configured Azure API Management gateways in various regions, the API gateway will still forward requests to the same backend service, which is deployed in only one region. In this case, the performance gain will come only from responses cached within Azure API Management in a region specific to the request; contacting the backend across the globe may still cause high latency.

To take advantage of geographical distribution of your system, you should have backend services deployed in the same regions as Azure API Management instances. Then, using policies and @(context.Deployment.Region) property, you can route the traffic to local instances of your backend.

[2] Deploy API Management Basic/Standard SKU instances

If the Premium SKU does not meet your budget, it is possible to distribute traffic by placing a Front Door or Traffic Manager in front of the API Management instance deployed in each region. This is a slightly modified version of [1].

Want to use AOAI from closed network

So far I have not considered specific requirements for closed networks. If closed network is mandatory, the following pattern would work. The consideration point is where you specify as a entry point of VNet.

[1]-a Application Gateway where Global load balancer directs traffic

Front Door or Traffic Manager routes requests to Application Gateway in front of API Management, and Application Gateway routes the traffic to VNet. In this diagram, there are two Application Gateways, but we don’t have to create more than one instance of Application Gateway because it can route requests not only in public network but also in private network.

Currently (as of July 8, 2023), we cannot leverage Private Link connection between Front Door and Application Gateway. If this configuration is available in future, we could improve security since we can configure Front Door as an entry point of VNet and we don’t have to assign a public IP address to Application Gateway.

Please note that this configuration requires API Management Premium SKU (in other words, “expensive”).

[1]-b API Management where Global load balancer directs traffic

If WAF (Web Application Firewall) is enabled at Front Door, we can also configure API Management as an entry point of VNet. In this case, please note that this configuration requires API Management Premium SKU, too.

[2]-a At Application Gatway where API Management routes requests

If the cost of API Management should be lowered, then you can also choose to use Application Gateway for load balancing of backend services as the entry point to VNet. In this configuration, we can choose other SKUs, such as Standard and Basic. Please note that we should configure the Application Gateway accepts requests that arrive from API Management only, for example by using something like X-Azure-FDID, which is added to the HTTP Header by the Front Door, or by filtering by IP address.

[2]-b At API Management v2 Standard (Preview)

API Management v2 Standard, which is currently in Public Preview, enables VNet integration. So, you can connect directly to VNet from API Management without an Application Gateway.

However, it should be noted that it is still in public preview and there are some limitations. For more details, please follow the URL below.

[3] Connect from on premise network

Here is a sample topology to leverage AOAI from on premise network via ExpressRoute and VPN. Indeed, entry point to VNet is a gateway for ExpressRoute or VPN, but other configurations are not anything special and similar to [1]-a and [1]-b. In this case, API Management Premium SKU is required. App Service is deployed and connected via VNet Integration and Private Endpoint, but this is also common and what we can see many times.

--

--

Akihiro Nishikawa
Microsoft Azure

Cloud Solution Architect @ Microsoft, and JJUG (Japan Java Users Group) board member. ♥Java (JVM/GraalVM) and open-source technologies. All views are my own.