Azure GPT Pricing Review: Benchmarks and Usage Guide

6 min readSep 19, 2024

Before we get started, if you like to use the power of AI Models without restrictions, without the hassle of paying for 10+ Subscriptions, use Anakin AI to manage them all!

Anakin.ai - One-Stop AI App Platform

Generate Content, Images, Videos, and Voice; Craft Automated Workflows, Custom AI Apps, and Intelligent Agents. Your…

bit.ly

Additionally, you can easily build your AI Agent workflow with a No Code App Builder. You can turbo charge your productivity with the Power of AI in No Time!

Azure GPT Pricing Overview

Azure OpenAI Service offers a variety of pricing models tailored to different usage needs, which can be confusing without a detailed breakdown. Here, we will delve into the pricing structure, benchmarks, and a usage guide to help users make informed decisions.

Pricing Models

Azure OpenAI Service provides two primary pricing models: Standard (On-Demand) and Provisioned (PTU). Each model has its own set of costs and benefits.

Standard (On-Demand) Pricing:
Input and Output Tokens: The cost is based on the number of input and output tokens used. This model is ideal for users with variable or unpredictable workload patterns.
No Upfront Costs: Users pay only for what they use, making it suitable for those who do not want to commit to a specific amount of resources.
Flexibility: This model allows users to scale up or down as needed without long-term commitments.
Provisioned (PTU) Pricing:
Performance Throughput Units (PTUs): Users purchase a predefined number of PTUs, which guarantee a certain level of performance. This model is best for users who require consistent and high volumes of requests.
Cost Efficiency: PTUs can offer cost savings for high-volume users, as they provide a dedicated amount of compute resources.
Reservation Options: Users can opt for monthly or yearly reservations, which can further reduce costs for long-term commitments.

Model-Specific Pricing

Different models within Azure OpenAI have varying pricing structures based on their context size and capabilities:

GPT-4 and GPT-4–32k:
These models are available with context sizes of 8K and 32K tokens respectively.
Pricing Details: For GPT-4, the minimum scaling increment is 50 PTUs, while for GPT-4–32k, it is 200 PTUs.
GPT-3.5 Turbo:
This model comes in different variants such as GPT-3.5-Turbo-0125 and GPT-3.5-Turbo-Instruct.
Pricing Details: The GPT-3.5-Turbo models have different context sizes (4K and 16K tokens) and minimum scaling increments (100 PTUs for some variants).

Benchmarking and Performance

To optimize the performance of Azure OpenAI deployments, benchmarking is crucial. Here’s how you can use the Azure OpenAI Benchmarking tool:

Setting Up the Benchmarking Tool

Prerequisites:

An Azure OpenAI Service resource with a provisioned deployment (either Provisioned or Provisioned-Managed) deployment type.
The resource endpoint and access key stored in the OPENAI_API_KEY environment variable.

Installation:

Install the necessary packages using pip install -r requirements.txt.
Build and run the benchmark tool using either Python or Docker:
python -m benchmark.bench load --help docker build -t azure-openai-benchmarking . docker run azure-openai-benchmarking load --help

Running the Benchmark:

Use the following command to run the benchmark:
python -m benchmark.bench load --deployment gpt-4 --rate 60 --retry exponential --endpoint https://myaccount.openai.azure.com
This command will output key performance statistics, including average and 95th percentile latencies, and utilization of the deployment.

Example Output

The benchmark tool provides detailed statistics for performance evaluation:

2023-10-19 18:21:06 INFO using shape profile balanced: context tokens: 500, max tokens: 500
2023-10-19 18:21:06 INFO warming up prompt cache
2023-10-19 18:21:06 INFO starting load...

2023-10-19 18:21:07 rpm: 5.0 requests: 5 failures: 0 throttled: 0 ctx tpm: 2505.0 gen tpm: 515.0 ttft avg: 0.937 ttft 95th: 1.321 tbt avg: 0.042 tbt 95th: 0.043 e2e avg: 1.223 e2e 95th: 1.658 util avg: 0.8% util 95th: 1.6%

This output helps in optimizing the solution design by adjusting prompt size, generation size, and PTUs deployed.

Usage Guide

Accessing Azure OpenAI Service

Requesting Access:

For certain models like GPT-4, users need to request access through a form, especially during the preview phase.
Existing customers can apply for access to new models and use cases without reapplying to the service.

Deploying Models:

Models such as GPT-4 and GPT-4–32k are now generally available in all US regions and do not require a waitlist.
Users can deploy these models using the Azure portal or via APIs.

Integrating with Other Azure Services

Azure OpenAI can be integrated with various Azure services to enhance functionality:

Azure Cognitive Search and Blob Storage:

Users can integrate Azure OpenAI with cognitive search and blob storage to access company data. This can be done using custom plugins or tools like Dataherald, which supports Langchain custom tools and agents.
Example: Using Databricks to create tables pointing to Azure blob storage and then connecting Dataherald to the Databricks endpoint.

Speech to Text with Whisper Model:

Azure OpenAI now supports speech to text APIs powered by OpenAI’s Whisper model. This allows users to get AI-generated text from speech audio.

Use Cases and Optimization

Enterprise Use Cases

Enterprise Performance: Azure OpenAI, targeted at enterprise clients, offers faster performance compared to OpenAI’s public API. For example, Azure GPT-4 is 2.77 times faster than OpenAI’s GPT-4 for the same prompts.
Scalability: Enterprise clients can leverage higher default quota limits and increased training job sizes for fine-tuning models, allowing for more extensive and complex use cases.

Fine-Tuning Models

Training Job Size: The maximum training job size has been increased to 2 billion tokens for all models, with a maximum training job duration of 720 hours.
Hosting and Training: Fine-tuning models can be hosted and trained using Azure OpenAI, with detailed pricing for training and hosting per hour.

Regional Availability and Quota Limits

New Regions:

Azure OpenAI is now available in additional regions such as Sweden Central and Switzerland North, enhancing global access.

Regional Quota Limits:

Quota limits have been increased for certain models and regions, allowing higher tokens per minute (TPM) and better performance for migrated workloads.

Cost Management

Standard vs Provisioned Pricing

On-Demand Pricing:

Suitable for variable workloads with no upfront costs. Users pay for each token used.
Example: If a user needs to process 100,000 input tokens and 100,000 output tokens, they will be charged based on the per-token rate for the specific model used.

Provisioned Pricing:

Ideal for high-volume, consistent workloads with dedicated PTUs.
Example: A user purchasing 50 PTUs for GPT-4 will have a guaranteed level of performance, with costs based on the hourly rate of PTUs and any applicable reservations.

Reservation Options

Monthly and Yearly Reservations:

Users can opt for monthly or yearly reservations to reduce costs for long-term commitments. These reservations provide fixed costs per PTU, helping in budget planning.

Best Practices for Optimization

Prompt and Generation Size Optimization

Benchmarking: Use the Azure OpenAI Benchmarking tool to determine the optimal prompt and generation sizes for your workload.
Load Testing: Run load tests with varying prompt sizes and PTUs to find the sweet spot for your application’s performance and cost.

Utilizing PTUs Efficiently

Dedicated Resources: Ensure that the provisioned PTUs are utilized efficiently by maintaining a consistent workload.
Scaling: Adjust the number of PTUs based on the benchmark results to achieve optimal performance and cost balance.

Data Integration and Access

Secure Data Access: Use private endpoints and filter access to sensitive documents to secure data integration with Azure OpenAI.
Automated Index Refresh: Automatically refresh the index on a schedule to ensure that the data remains up-to-date.

Conclusion

Azure OpenAI Service offers robust pricing models and performance capabilities, making it a powerful tool for enterprise and scale-intensive applications. By understanding the pricing structures, leveraging benchmarking tools, and optimizing usage, users can maximize the value from Azure OpenAI while managing costs effectively. Whether you are integrating with other Azure services or fine-tuning models, Azure OpenAI provides a comprehensive suite of features to enhance your AI-driven projects.

Before we get started, if you like to use the power of AI Models without restrictions, without the hassle of paying for 10+ Subscriptions, use Anakin AI to manage them all!

Anakin.ai - One-Stop AI App Platform

Generate Content, Images, Videos, and Voice; Craft Automated Workflows, Custom AI Apps, and Intelligent Agents. Your…

bit.ly

Additionally, you can easily build your AI Agent workflow with a No Code App Builder. You can turbo charge your productivity with the Power of AI in No Time!

Azure GPT Pricing Review: Benchmarks and Usage Guide

Anakin.ai - One-Stop AI App Platform

Generate Content, Images, Videos, and Voice; Craft Automated Workflows, Custom AI Apps, and Intelligent Agents. Your…

Azure GPT Pricing Overview

Pricing Models

Model-Specific Pricing

Benchmarking and Performance

Setting Up the Benchmarking Tool

Example Output

Usage Guide

Accessing Azure OpenAI Service

Integrating with Other Azure Services

Use Cases and Optimization

Enterprise Use Cases

Fine-Tuning Models

Regional Availability and Quota Limits

Cost Management

Standard vs Provisioned Pricing

Reservation Options

Best Practices for Optimization

Prompt and Generation Size Optimization

Utilizing PTUs Efficiently

Data Integration and Access

Conclusion

Anakin.ai - One-Stop AI App Platform

Generate Content, Images, Videos, and Voice; Craft Automated Workflows, Custom AI Apps, and Intelligent Agents. Your…

Written by AI Tech Guy