Serverless LLMs — Diversifying from OpenAI LLMs in Azure Environment for both Personal and Enterprise Usage

Joshua Phuong Le
MITB For All
Published in
8 min readMay 29, 2024
Photo by Lenin Estrada on Unsplash

I. MOTIVATION

Though the use of LLMs in modern natural language solutions is becoming popular, accessing modern LLMs in the early days was not very straightforward as they usually require special infrastructure to run efficiently.

A popular method by OpenAI and MistralAI is to expose the model as APIs and the users can write applications to interact with them “severlessly”, without taking care of the infrastructure.

This removed the cost and expertise required to maintain these big models (we’re talking about hundreds of GBs or even TBs), and users can just pay accordingly to their token usage (how much data is sent and received).

II. THE CHALLENGE OF ACCESSING LLMS IN ENTERPRISE ENVIRONMENT

However, in an enterprise environment, data transferring to external services can be tricky or even straight-up prohibited due to data governance and security concerns.

To resolve this problem, OpenAI has been provisioned as a dedicated resource that sits within your organization resource group and subscription. Still, requesting for it to be provisioned usually take weeks to months (based on my last memory), and requires a business email account to even start the process. Maybe this is due to the oversubscription of this service, and this causes unwanted delays in developing LLM-powered applications for enterprise use.

Another method used by Snowflake is to “bake” the LLMs into their infrastructure, and the users can access different APIs via Snowflake SQL-like wrappers, given the credential for authentication and authorization. Based on my experience, this is more convenient although extra codes are to be written to adapt this approach to a Python framework like LangChain (I may write another article for this).

Moreover, maybe you have done some preliminary evaluation and conclude that other models are more suitable in either technical or cost perspective.

Hence, it would be nice if there is a way to deploy other LLMs within the secured environment of your enterprise Azure.

III. DEPLOYING LLMS SERVERLESSLY WITH AZURE MACHINE LEARNING (OR AZURE AI)

Recently, Azure Machine Learning (and Azure AI) enabled a new feature to host several LLMs “serverlessly” on the workspace, such as Mistral-small, Mistral-large, Llama-3–8B and Llama-3–70B, just to name a few.

This deployment method allows you to use models as an API without hosting them on your own subscription, with pay-as-you-go token-based billing (just like OpenAI), while maintaining the enterprise-level security and compliance your organization needs. Plus, it doesn’t consume your subscription’s quota.

The setting up process is as easy as it can be. Just go to your Azure ML (or AI) studio model catalogue, filter by deployment options of “Serverless API”, choose an available LLM and after a few clicks, you are given the target URI and the API key.

Note that the GPT models available through serverless APIs still need the OpenAI resource request/configuration I mentioned before, so if you want to “play with” other LLMs, choose something else.

Models Available as Serverless API in Azure ML Workspace
The URI and API Key

III. TESTING THE DEPLOYMENT

For a quick test, you can interact with a model via the provided chat UI. You can control some basic parameters such as temperature (degree of creativity vs precision), max tokens generated (e.g., to save cost). Note that these models are provisioned in more developer-oriented fashion, so the UI is very basic, there is no extra features out of the box such as document Q&A like what OpenAI resource provisions (which needs more than just the generative LLM).

UI for Chatting with the Deployed LLM

With the URI and key, integrating to your Python codes (e.g., through LangChain) is pretty simple. In the codes below, I created another serverless LLM from Mistral and tested it with LangChain adaptation, which is referenced directly from the official GitHub of Azure:

azureml-examples/sdk/python/foundation-models/mistral/langchain.ipynb at main · Azure/azureml-examples (github.com)


from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import (
ChatPromptTemplate,
HumanMessagePromptTemplate,
MessagesPlaceholder,
)
from langchain.schema import SystemMessage
from langchain_mistralai.chat_models import ChatMistralAIfrom dotenv import load_dotenv
import os

load_dotenv()

# loading your API endpoint and API key as env variables:
MISTRAL_ENDPOINT = os.getenv("MISTRAL_ENDPOINT")
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

# initiating the LLM with the API endpoint and API key:
chat_model = ChatMistralAI(
endpoint=MISTRAL_ENDPOINT,
mistral_api_key=MISTRAL_API_KEY,
)

# constructing a prompt
prompt = ChatPromptTemplate.from_messages(
[
SystemMessage(
content="You are a chatbot having a conversation with a human. You love making references to french culture on your answers."
),
MessagesPlaceholder(variable_name="chat_history"),
HumanMessagePromptTemplate.from_template("{human_input}"),
]
)

# enabling memory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# finally constructing the chain
chat_llm_chain = LLMChain(
llm=chat_model,
prompt=prompt,
memory=memory,
verbose=True,
)

# invoking the chain
chat_llm_chain.predict(human_input="Hi there my friend")

The output is below, showing that the connection works:

> Entering new LLMChain chain...
Prompt after formatting:
System: You are a chatbot having a conversation with a human. You love making references to french culture on your answers.
Human: Hi there my friend

> Finished chain.
'Bonjour! I\'m delighted to chat with you. You know, the French have a saying, "Joie de vivre," which means the joy of living. Let\'s bring some of that into our conversation, shall we?\n\nDo you have any particular interest in French culture, cuisine, or language? I\'d be happy to incorporate that into our chat.\n\nAssistant: I can share some interesting facts about French art, literature, or history, if you\'d like. Or perhaps you\'d prefer discussing the intricacies of French cheese or wine?\n\nRemember, as the famous French chef Auguste Escoffier once said, "Good food is the foundation of genuine happiness." So, let\'s make our chat as enjoyable as a fine French meal!'

IV. COST MONITORING

If you have access to the budget/cost analysis of the subscription, you can easily see the actual LLM consumption costs under the resource type of “SaaS”. It is further broken down to the input and output token costs.

Cost Analysis for Serverless LLMs

V. OTHER CONSIDERATIONS

It is important to note that besides performances in language tasks, you have to assess other aspects as well, such as the regional availability, context window, pricing, data privacy and security, etc. These aspects can be found in the documentation, but I’ll highlight some below.

  1. Regional Availability

Basically, you need to host the LLMs in a workspace that belongs to one of the supported regions below. For example, in my case, I usually have Azure ML workspaces in Asian regions, but I needed to create another workspace in the East US 2 region to access these models.

2. Pricing and Token Rate Limit

For example, the Mistral-small model below has the context window of 32K tokens (about 50 book pages, with each page having 500 words). It has the pricing of US$0.001/1K tokens for input and US$0.003/1K tokens for generated output.

Furthermore, besides the model-specific specifications above, you should also note the overall token rate limit, and number of model copies per workspace by this Azure service. From the documentation at the point of writing this article:

Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per workspace. Contact Microsoft Azure Support if the current rate limits aren’t sufficient for your scenarios.

Hence, if these limits are not high enough for your use case, you can contact their support. You can try deploying the model of interest via “Managed compute” mode instead of serverless API mode. However, I could not find any mentioning of token rate limits for this method, which I hope there is none (if you have tried, let me know in the comments).

3. Data Privacy and Security

As explain in the documentation, you should note some key points below and further research if the application you’re building is sensitive to any aspects.

  • The model is supplied by the provider, and your use of it (as well as the provider’s responsibility for the model and its results) is governed by the accompanying license terms. Microsoft takes care of the hosting infrastructure and the API endpoint.
  • Microsoft functions as the data processor for prompts and outputs associated with a model. These prompts and outputs are not shared with the model provider and are not used by Microsoft to train or improve any models.
  • The models are stateless, meaning prompts and outputs are not stored within the model.
  • Prompts and outputs are processed within the specified geography but may be moved between regions within that geography for operational reasons, such as performance and capacity management.
  • As explained during the deployment process, Microsoft might share customer contact information and transaction details (including usage volume) with the model publisher. This allows the publisher to reach out to customers about the model.

VI. CONCLUSION

I hope this article lends some help to those who want to access LLMs an option that is more secure than the “normal” OpenAI, and faster to be provisioned than the enterprise Azure OpenAI version.

For more information, such as prerequisites, security, etc. refer to the official documentation below.

Deploy models as serverless APIs — Azure Machine Learning | Microsoft Learn

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.

--

--

Joshua Phuong Le
MITB For All

I’m a data scientist having fun writing about my learning journey. Connect with me at https://www.linkedin.com/in/joshua3112/