Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Use this article to learn about calculating and understanding costs associated with PTU. For an overview of the provisioned throughput offering, see What is provisioned throughput?. When you're ready to sign up for the provisioned throughput offering, see the getting started guide.
Note
In function calling and agent use cases, token usage can be variable. You should understand your expected Tokens Per Minute (TPM) usage in detail before migrating workloads to PTU.
Provisioned throughput units
Provisioned throughput units (PTUs) are generic units of model processing capacity that you can use to size provisioned deployments to achieve the required throughput for processing prompts and generating completions. Provisioned throughput units are granted to a subscription as quota. Each quota is specific to a region and defines the maximum number of PTUs that can be assigned to deployments in that subscription and region.
Understanding provisioned throughput billing
Azure OpenAI Provisioned, Data Zone Provisioned (also known as regional), and Global Provisioned are purchased on-demand at an hourly basis based on the number of deployed PTUs, with substantial term discount available via the purchase of Azure Reservations.
The hourly model is useful for short-term deployment needs, such as validating new models or acquiring capacity for a hackathon. However, the discounts provided by the Azure Reservation for Azure OpenAI Provisioned, Data Zone Provisioned, and Global Provisioned are considerable and most customers with consistent long-term usage will find a reserved model to be a better value proposition.
Note
Azure OpenAI Provisioned customers onboarded prior to the August self-service update use a purchase model called the Commitment model. These customers can continue to use this older purchase model alongside the Hourly/reservation purchase model. The Commitment model is not available for new customers or certain new models introduced after August 2024. For details on the Commitment purchase model and options for coexistence and migration, see the Azure OpenAI Provisioned August Update.
Model independent quota
Unlike the Tokens Per Minute (TPM) quota used by other Azure OpenAI offerings, PTUs are model-independent. The PTUs might be used to deploy any supported model/version in the region.
Quota for provisioned deployments shows up in Azure AI Foundry as the following deployment types: global provisioned, data zone provisioned and standard provisioned.
deployment type | Quota name |
---|---|
provisioned | Provisioned Managed Throughput Unit |
global provisioned | Global Provisioned Managed Throughput Unit |
data zone provisioned | Data Zone Provisioned Managed Throughput Unit |
Note
Global provisioned and data zone provisioned deployments are only supported for gpt-4o and gpt-4o-mini models at this time. For more information on model availability, review the models documentation.
Hourly usage
Provisioned, Data Zone Provisioned, and Global Provisioned deployments are charged an hourly rate ($/PTU/hr) on the number of PTUs that have been deployed. For example, a 300 PTU deployment will be charged the hourly rate times 300. All Azure OpenAI pricing is available in the Azure Pricing Calculator.
If a deployment exists for a partial hour, it will receive a prorated charge based on the number of minutes it was deployed during the hour. For example, a deployment that exists for 15 minutes during an hour will receive 1/4th the hourly charge.
If the deployment size is changed, the costs of the deployment will adjust to match the new number of PTUs.
Paying for provisioned, data zoned provisioned, and global provisioned deployments on an hourly basis is ideal for short-term deployment scenarios. For example: Quality and performance benchmarking of new models, or temporarily increasing PTU capacity to cover an event such as a hackathon.
Customers that require long-term usage of provisioned, data zoned provisioned, and global provisioned deployments, however, might pay significantly less per month by purchasing a term discount via Azure Reservations as discussed later in the article.
Important
It's not recommended to scale production deployments according to incoming traffic and pay for them purely on an hourly basis. There are two reasons for this:
- The cost savings achieved by purchasing Azure Reservations for Azure OpenAI Provisioned, Data Zone Provisioned, and Global Provisioned are significant, and it will be less expensive in many cases to maintain a deployment sized for full production volume paid for via a reservation than it would be to scale the deployment with incoming traffic.
- Having unused provisioned quota (PTUs) doesn't guarantee that capacity will be available to support an increase in the size of the deployment when required. Quota limits the maximum number of PTUs that can be deployed, but it isn't a capacity guarantee. Provisioned capacity for each region and model dynamically changes throughout the day and might not be available when required. As a result, it's recommended to maintain a permanent deployment to cover your traffic needs (paid for via a reservation). Charges for deployments on a deleted resource will continue until the resource is purged. To prevent this, delete a resource’s deployment before deleting the resource. For more information, see Recover or purge deleted Azure AI services resources.
How much throughput per PTU you get for each model
The amount of throughput (measured in tokens per minute or TPM) a deployment gets per PTU is a function of the input and output tokens in a given minute. Generating output tokens requires more processing than input tokens. Starting with GPT 4.1 models and later, the system matches the global standard price ratio between input and output tokens. Cached tokens are deducted 100% from the utilization.
For example, for gpt-4.1:2025-04-14
, 1 output token counts as 4 input tokens towards your utilization limit which matches the pricing. Older models use a different ratio and for a deeper understanding on how different ratios of input and output tokens impact the throughput your workload needs, see the Azure OpenAI capacity calculator.
Topic | gpt-4o | gpt-4o-mini | o1 | gpt-4.1 |
---|---|---|---|---|
Global & data zone provisioned minimum deployment | 15 | 15 | 15 | 15 |
Global & data zone provisioned scale increment | 5 | 5 | 5 | 5 |
Regional provisioned minimum deployment | 50 | 25 | 50 | 50 |
Regional provisioned scale increment | 50 | 25 | 50 | 50 |
Input TPM per PTU | 2,500 | 37,000 | 230 | 3000 |
Latency Target Value | 25 Tokens Per Second | 33 Tokens Per Second | 25 Tokens Per Second | 44 Tokens Per Second |
For a full list, see the Azure OpenAI Service in Azure AI Foundry portal calculator.
Determining the number of PTUs needed for a workload
Determining the right amount of provisioned throughput, or PTUs, you require for your workload is an essential step to optimizing performance and cost.
PTUs represent an amount of model processing capacity. Similar to your computer or databases, different workloads or requests to the model will consume different amounts of underlying processing capacity. The conversion from throughput needs to PTUs can be approximated using historical token usage data or call shape estimations (input tokens, output tokens, and requests per minute) as outlined in our performance and latency documentation. To simplify this process, you can use the Azure OpenAI Capacity calculator to size specific workload shapes.
A few high-level considerations:
- Generations require more capacity than prompts
- For GPT-4o and later models, the TPM per PTU is set for input and output tokens separately. For older models, larger calls are progressively more expensive to compute. For example, 100 calls of with a 1000 token prompt size requires less capacity than one call with 100,000 tokens in the prompt. This tiering means that the distribution of these call shapes is important in overall throughput. Traffic patterns with a wide distribution that includes some large calls might experience lower throughput per PTU than a narrower distribution with the same average prompt & completion token sizes.
Obtaining PTU quota
PTU quota is available by default in many regions. If more quota is required, customers can request quota via the Request Quota link. This link can be found to the right of the designated provisioned deployment type quota tabs in Azure AI Foundry The form allows the customer to request an increase in the specified PTU quota for a given region. The customer receives an email at the included address once the request is approved, typically within two business days.
Per-Model PTU minimums
The minimum PTU deployment, increments, and processing capacity associated with each unit varies by model type & version. See the above table for more information.
Estimate provisioned throughput units and cost
To get a quick estimate for your workload using input and output TPM, leverage the built-in capacity planner in the deployment details section of the deployment dialogue screen. The built-in capacity planner is part of the deployment workflow to help streamline the sizing and allocation of quota to a PTU deployment for a given workload. For more information on how to identify and estimate TPM data, review the recommendations in our performance and latency documentation.
To use the capacity planner, go to the Azure AI Foundry Portal and select the Deployments button. Then select Deploy model.
Choose a model, and click Confirm. Select a provision-managed deployment type. After filling out the input and output TPM data in the built-in capacity calculator, select the Calculate button to view your PTU allocation recommendation.
To estimate provisioned capacity using request level data, open the capacity planner in the Azure AI Foundry. The capacity calculator is under Shared resources > Model Quota > Azure OpenAI Provisioned.
The Provisioned option and the capacity planner are only available in certain regions within the Quota pane, if you don't see this option setting the quota region to Sweden Central will make this option available. Enter the following parameters based on your workload.
Input | Description |
---|---|
Model | OpenAI model you plan to use. For example: GPT-4 |
Version | Version of the model you plan to use, for example 0614 |
Peak calls per min | The number of calls per minute that are expected to be sent to the model |
Tokens in prompt call | The number of tokens in the prompt for each call to the model. Calls with larger prompts utilize more of the PTU deployment. Currently this calculator assumes a single prompt value so for workloads with wide variance. We recommend benchmarking your deployment on your traffic to determine the most accurate estimate of PTU needed for your deployment. |
Tokens in model response | The number of tokens generated from each call to the model. Calls with larger generation sizes utilize more of the PTU deployment. Currently this calculator assumes a single prompt value so for workloads with wide variance. We recommend benchmarking your deployment on your traffic to determine the most accurate estimate of PTU needed for your deployment. |
After you fill in the required details, select Calculate button in the output column.
The values in the output column are the estimated value of PTU units required for the provided workload inputs. The first output value represents the estimated PTU units required for the workload, rounded to the nearest PTU scale increment. The second output value represents the raw estimated PTU units required for the workload. The token totals are calculated using the following equation: Total = Peak calls per minute * (Tokens in prompt call + Tokens in model response)
.
Note
The capacity calculators provide an estimate based on simple input criteria. The most accurate way to determine your capacity is to benchmark a deployment with a representational workload for your use case.
Azure Reservations for Azure OpenAI provisioned deployments
Discounts on top of the hourly usage price can be obtained by purchasing an Azure Reservation for Azure OpenAI Provisioned, Data Zone Provisioned, and Global Provisioned. An Azure Reservation is a term-discounting mechanism shared by many Azure products. For example, Compute and Cosmos DB. For Azure OpenAI Provisioned, Data Zone Provisioned, and Global Provisioned, the reservation provides a discount in exchange for committing to payment for fixed number of PTUs for a one-month or one-year period.
Azure Reservations are purchased via the Azure portal, not the Azure AI Foundry portal Link to Azure reservation portal.
Reservations are purchased regionally and can be flexibly scoped to cover usage from a group of deployments. Reservation scopes include:
Individual resource groups or subscriptions
A group of subscriptions in a Management Group
All subscriptions in a billing account
New reservations can be purchased to cover the same scope as existing reservations, to allow for discounting of new provisioned deployments. The scope of existing reservations can also be updated at any time without penalty, for example to cover a new subscription.
Reservations for Global, Data Zone, and Regional deployments aren't interchangeable. You need to purchase a separate reservation for each deployment type.
Reservations can be canceled after purchase, but credits are limited.
If the size of provisioned deployments within the scope of a reservation exceeds the amount of the reservation, the excess is charged at the hourly rate. For example, if deployments amounting to 250 PTUs exist within the scope of a 200 PTU reservation, 50 PTUs will be charged on an hourly basis until the deployment sizes are reduced to 200 PTUs, or a new reservation is created to cover the remaining 50.
Reservations guarantee a discounted price for the selected term. They don't reserve capacity on the service or guarantee that it will be available when a deployment is created. It's highly recommended that customers create deployments prior to purchasing a reservation to prevent from over-purchasing a reservation.
Important
Capacity availability for model deployments is dynamic and changes frequently across regions and models. To prevent you from purchasing a reservation for more PTUs than you can use, create deployments first, and then purchase the Azure Reservation to cover the PTUs you have deployed. This best practice will ensure that you can take full advantage of the reservation discount and prevent you from purchasing a term commitment that you cannot use.
The Azure role and tenant policy requirements to purchase a reservation are different than those required to create a deployment or Azure OpenAI resource. Verify authorization to purchase reservations in advance of needing to do so. See Azure OpenAI Provisioned reservation documentation for more details.
Important: sizing Azure OpenAI provisioned reservations
The PTU amounts in reservation purchases are independent of PTUs allocated in quota or used in deployments. It's possible to purchase a reservation for more PTUs than you have in quota, or can deploy for the desired region, model, or version. Credits for over-purchasing a reservation are limited, and customers must take steps to ensure they maintain their reservation sizes in line with their deployed PTUs.
The best practice is to always purchase a reservation after deployments have been created. This prevents purchasing a reservation and then finding out that the required capacity isn't available for the desired region or model.
Reservations for Global, Data Zone, and Regional deployments aren't interchangeable. You need to purchase a separate reservation for each deployment type.
To assist customers with purchasing the correct reservation amounts. The total number of PTUs in a subscription and region that can be covered by a reservation are listed on the Quotas page of Azure AI Foundry. See the message "PTUs Available for reservation."
Managing Azure Reservations
After a reservation is created, it is a best practice monitor it to ensure it is receiving the usage you're expecting. This can be done via the Azure Reservation Portal or Azure Monitor. Details on these articles and others can be found here:
- View Azure reservation utilization
- View Azure Reservation purchase and refund transactions
- View amortized benefit costs
- Charge back Azure Reservation costs
- Automatically renew Azure reservations