Behind the Scenes : Our adventure towards Azure OpenAI services
When It All Began
When we started implementing LLM-based services on doctrine.fr, we quickly realized that the potential offered would generate significant excitement among our base of 15,000 clients.
We already had our custom Machine Learning Models for data acquisition pipelines, whether for detecting dates, named entities, or searching for dependencies (legal graph), for which we operate our own GPU-powered compute instances (based on Kubernetes or managed EC2 SageMaker).
When it comes to LLM models provided by major market players, you must subscribe online and interact using an API key.
Given the growing demand for LLM needs and the lack of available infrastructure, each provider imposes usage limits (OpenAI, Anthropic, Google Deep Mind). One of our main requirements is that our data be processed within the European Union, which OpenAI cannot currently guarantee.
We turned to Microsoft Azure, which sells access to OpenAI services and guarantees data processing within the European Union. Again, quotas apply, but there are two ways to extend them: engage with the sales teams to negotiate a commercial arrangement. Another option is to open new Azure Subscriptions. The quotas being attached by subscription, opening more subscriptions allows you to get more quotas.
A Subscription?
In the Microsoft Azure cloud, a subscription is a logical association of resources/services with a dedicated billing line. Each subscription can be subject to commercial agreements with Microsoft.
The subscription then contains logical groupings, called Resource Groups, where it is possible to include services such as Microsoft Azure OpenAI services.
Accessing the OpenAI Service
By default, OpenAI services were not accessible. This has changed recently. A series of steps is still required to obtain the holy grail.
First, an initial form must be completed to gain access to OpenAI services. This includes justifying use cases, proving the company’s identity, and so on. Recently, Microsoft has enabled its customers to access OpenAI services without jumping through these hoops.
Once access is granted, further administrative steps are still required:
- Requesting Microsoft not to log queries (abuse monitoring).
The confidentiality of data entrusted to us by our clients is critical. This is why our entire infrastructure is based in Europe (GDPR protection), and we also forbid Microsoft from tracking the requests we send to the OpenAI API.
• Requesting Microsoft not to apply content filters.
By default, Microsoft adopts a responsible approach to artificial intelligence (responsible AI) based on six pillars. This means that all exchanges sent to or received from LLM systems are subject to very strict moderation.
Each time, the form must be completed, and patience is essential because a well-defined use case is required, and we must wait for Microsoft teams to respond. This process can take up to 10 days.
Configuring OpenAI Services with Infrastructure as Code
At Doctrine, we configure the entire infrastructure as code. All Azure services can be managed through IAC using two Terraform providers that interact with Azure’s various APIs.
There are two providers that call two different APIs: AzureAPI, which is low-level and available immediately, and AzureRM, which is based on the specific management API.
A simple example: for setting up a “deployment” (configuring access to an OpenAI model), the same can be achieved with both providers.
We had to use the low-level provider “AzureAPI” because a dynamic quota feature took a year to be implemented in the AzureRM provider.
Deployment Logic for Azure Cognitive Services Resources
There is not “one single” version of the OpenAI API, but multiple versions of models depending on the region. These models are themselves versioned. The availability of these models is achieved through the configuration of “deployments” in each region.
Here is an example of model availability by region for GPT models in the Azure cloud (as of early December 2024).
This forces us to deploy multiple endpoints in each region to take advantage of the different versions we need. Concretely, this is what it can look like: for one subscription, we may end up with five different endpoints and five different API keys.
Can You See the Challenges Coming?
Given that we face quota issues, which we managed to resolve with new subscriptions, the situation becomes even more complicated. The developer does not need to know five different API keys, but FIFTEEN.
And they have no way of knowing the availability status of each deployment across regions:
• Some regions may provide access to certain models, while others do not;
• Some regions may face a quota-exceeded situation;
• Some regions may undergo temporary maintenance.
Deployment of a Managed Microsoft API (SmartAPIM)
To address this issue, we decided to implement a single entry point that can reach OpenAI services.
This means one API key per application (instead of fifteen) and only one path to configure in the applications.
Here’s what it looks like in practice: a developer retrieves the secret they need from a secret manager (Hashicorp Vault). The metadata for their secret provides the address of the endpoint to call.
First Version of the Managed API: A Rather Simplistic APIM
The logic is simple: the developer (or program) sends their API request to our Managed API, which sequentially queries one endpoint (in a given subscription and region), then another, and another, until it receives a satisfactory response.
By a satisfactory response, we mean an endpoint that effectively answers the request (with an HTTP/200), is not rate-limited, has a correctly configured deployment, and where the LLM model is indeed available in that region.
Initially, we only had one subscription and two or three regions (France, Sweden, and Norway). However, as the number of subscriptions tripled and the regions expanded, we faced a major challenge: latency time explosion.
Indeed, we declared to our load balancer the existence of about fifteen different endpoints (distributed across regions and subscriptions), and some of the latest LLM models were only available in specific regions (us-west-2 or Sweden Central). However, our load balancer was somewhat simplistic and simply queried the various endpoints, even if they could not respond.
Because there was a delay of one second between each new attempt, we could lose up to 14 seconds before finally querying the correct endpoint — and that was if it was even able to handle the request!
Implementing Intelligence Within the APIM (where the APIM becomes smart)
To address this issue, we developed a new load balancer alongside it, but made it smarter.
This load balancer handles each request, extracts the deployment name, and attempts to query the various endpoints.
If an endpoint responds with a permanent error (e.g., “the deployment does not exist here — error 404”), the APIM will use a local cache and record that this endpoint is not capable of serving this deployment.
The cache duration can be set as desired. There is no limit on the volume of calls to the cache. The cache size is limited based on the chosen APIM tier. In the development environment, the cache does not exceed 10 MB in the basic and 50 MB in developer tier, which is more than sufficient for our needs (a text file of a few kilobytes).
Here is the format of the records in the local cache:
The benefit is that for the next request with the same deployment name, the APIM will first consult its cache and exclude all deployments known to be unable to provide the service (listed here in nonServedDeployments).
Thus, only the “functional” endpoints are queried by the APIM, and the correct endpoints are directly targeted.
A Complex Implementation, Simplified by APIM Tracing Capabilities
The programming language for SmartAPIM is C#, itself embedded in XML. We used the example provided by the Microsoft teams, which we extensively modified to meet our needs.
Any syntax error results in rejection by Microsoft’s REST API, and it is not possible to perform a syntax validation before submitting the policy.
The details of the error message (in the response payload of the REST API) are, fortunately, somewhat clearer:
{“error”:{“code”:”ValidationError”,”message”:”One or more fields contain incorrect values:”,”details”:[{“code”:”ValidationError”,”target”:”representation”,”message”:”Name cannot begin with the ‘ ‘ character, hexadecimal value 0x20. Line 2, position 2.”}]}}: timestamp=”2024–12–10T09:49:03.348+0100”
Another significant challenge lies in the inability to test modifications upstream.
When something malfunctions, the causes of the problems can be multiple:
• Is the client’s request malformed?
• Is the APIM logic inconsistent?
• Is the backend not responding correctly?
• Has the client waited long enough?
A very useful feature that saved us is the ability to obtain detailed traces of the APIM’s behavior during request processing. A specific header, Ocp-Apim-Trace = True, must be added to the request, and the API key used must be configured in “Enable Tracing” mode. Then, the response will contain a special header Ocp-Apim-Trace-Location, with a link to a lengthy JSON object.
In this object, there is a detailed breakdown of the request processing for each line of executed code.
On the other hand, the client originating the request to the APIM must be able to retrieve and save the response headers somewhere, and the “Enable Tracing” mode must be reactivated every 60 minutes.
Reducing Smart APIM Operating Costs
The first design of our APIM was based on the “Premium” version of the Managed API offered by Microsoft. For a very high cost, you get a load balancer capable of interconnecting with private networks and a significant SLA of 99.99%. Under optimal conditions, the system can handle up to 4,000 requests per second (scalable up to 12 times more).
Here are the trade-offs we made to choose a less expensive version of the APIM:
• The SLA is not justified since the underlying system being queried (Azure OpenAI) only has an SLA of 99.9%;
• We do not need APIM interconnection with private Azure networks: we rely on IP filtering and TLS encryption for communications between the APIM and OpenAI endpoints;
• We conducted load tests and found that the limiting factor was not the APIM but the underlying endpoints (Azure OpenAI). Beyond 150 requests per second, we exhaust all the quota capacity allocated to us.
As a result, we downgraded our load balancer to “Basic” mode for production (1,000 requests per second) and “Developer” mode for other environments, and we still have plenty of room to spare!
Conclusion and Summary
Here’s what has kept us busy over the past few months. If you use our products, such as the Legal or Tax ChatBot, productivity tools for law firms, or the automatic summarization of court decisions, your request goes through the platform we have set up.
We are satisfied with the solution implemented.
In terms of quotas, we have tripled the existing quotas, and we can increase them even further if necessary.
All our requests are handled in Europe. Of course, we would have preferred to have all quotas directly available within a single subscription. Microsoft’s REST API is very well-documented, and the error messages are quite clear, which is a positive point and made it easier for us to deploy the infrastructure as code (with Terraform).
On the positive side, quota issues are gradually being resolved at Azure, notably with the implementation of DataZones (requests are processed on a pool of regions in the European Union or the United States). However, the path to simple access to scalable managed LLM services remains long and winding.