Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article summarizes the limitations and region availability for Mosaic AI Model Serving and supported endpoint types.
Resource and payload limits
Mosaic AI Model Serving imposes default limits to ensure reliable performance. If you have feedback on these limits, reach out to your Databricks account team.
The following table summarizes resource and payload limitations for model serving endpoints.
Feature | Granularity | Limit |
---|---|---|
Payload size | Per request | 16 MB. For endpoints serving foundation models, external models, or AI agents the limit is 4 MB. |
Request/response size | Per request | Any request/response over 1 MB will not be logged. |
Queries per second (QPS) | Per workspace | 200, but can be increased to 25,000 or more by reaching out to your Databricks account team. |
Model execution duration | Per request | 120 seconds |
CPU endpoint model memory usage | Per endpoint | 4GB |
GPU endpoint model memory usage | Per endpoint | Greater than or equal to assigned GPU memory, depends on the GPU workload size |
Provisioned concurrency | Per model and per workspace | 200 concurrency. Can be increased by reaching out to your Databricks account team. |
Overhead latency | Per request | Less than 50 milliseconds |
Init scripts | Init scripts are not supported. | |
Foundation Model APIs (pay-per-token) rate limits | Per workspace | If the following limits are insufficient for your use case, Databricks recommends using provisioned throughput.
|
Foundation Model APIs (provisioned throughput) rate limits | Per workspace | 200 queries per second. |
Networking and security limitations
- Model Serving endpoints are protected by access control and respect networking-related ingress rules configured on the workspace, like IP allowlists and Private Link.
- Private connectivity (such as Azure Private Link) is only supported for model serving endpoints that use provisioned throughput or endpoints that serve custom models.
- By default, Model Serving does not support Private Link to external endpoints (like, Azure OpenAI). Support for this functionality is evaluated and implemented on a per-region basis. Reach out to your Azure Databricks account team for more information.
- Model Serving does not provide security patches to existing model images because of the risk of destabilization to production deployments. A new model image created from a new model version will contain the latest patches. Reach out to your Databricks account team for more information.
Foundation Model APIs limits
Note
As part of providing the Foundation Model APIs, Databricks might process your data outside of the region where your data originated, but not outside of the relevant geographical location.
For both pay-per-token and provisioned throughput workloads:
- Only workspace admins can change the governance settings, such as rate limits for Foundation Model APIs endpoints. To change rate limits use the following steps:
- Open the Serving UI in your workspace to see your serving endpoints.
- From the kebab menu on the Foundation Model APIs endpoint you want to edit, select View details.
- From the kebab menu on the upper-right side of the endpoints details page, select Change rate limit.
- The GTE Large (En) embedding models do not generate normalized embeddings.
Pay-per-token limits
The following are limits relevant to Foundation Model APIs pay-per-token workloads:
- Pay-per-token workloads are HIPAA compliant.
- For customers with the Compliance Security Profile enabled, pay-per-token workloads are available provided that compliance standard HIPAA or None is selected. Other compliance standards are not supported for pay-per-token workloads.
- Meta Llama 4 Maverick, Meta Llama 3.3 70B, and GTE Large (En) models are available in pay-per-token EU and US supported regions.
- The following pay-per-token models are supported only in the Foundation Model APIs pay-per-token supported US regions:
- Meta Llama 3.1 405B Instruct
- BGE Large (En)
- Anthropic Claude 3.7 Sonnet is available in pay-per-token EU and US supported regions. If your workspace is not in an EU or US region, but is in a supported Model Serving region, you can enable cross-Geo data processing to access this model.
- If your workspace is in a Model Serving region but not a US or EU region, your workspace must be enabled for cross-Geo data processing. When enabled, your pay-per-token workload is routed to the U.S. Databricks Geo. To see which geographic regions process pay-per-token workloads, see Databricks Designated Services.
Provisioned throughput limits
The following are limits relevant to Foundation Model APIs provisioned throughput workloads:
Provisioned throughput supports the HIPAA compliance profile and is recommended for workloads that require compliance certifications.
To use the DBRX model architecture for a provisioned throughput workload, your serving endpoint must be in one of the following regions:
eastus
eastus2
westus
centralus
westeurope
northeurope
australiaeast
canadacentral
brazilsouth
To deploy a Meta Llama model from
system.ai
in Unity Catalog, you must choose the applicable Instruct version. Base versions of the Meta Llama models are not supported for deployment from Unity Catalog. See [Recommended] Deploy foundation models from Unity Catalog.For provisioned throughput workloads that use Llama 4 Maverick:
- Support for this model on provisioned throughput workloads is in preview. Reach out to your Databricks account team to participate in the preview.
- Autoscaling is not supported.
- Metrics panels are not supported.
- Traffic splitting is not supported on an endpoint that serves Llama 4 Maverick. You can not serve multiple models on an endpoint that serves Llama 4 Maverick.
The following table shows the region availability of the supported Meta Llama 3.1, 3.2, 3.3 and Meta Llama 4 Maverick models. See Deploy fine-tuned foundation models for guidance on how to deploy fine-tuned models.
Meta Llama model variant Regions meta-llama/Llama-3.1-8B australiaeast
centralus
eastus
eastus2
northcentralus
southcentralus
westus
westus2
northeurope
westeurope
uksouth
japaneast
meta-llama/Llama-3.1-8B-Instruct australiaeast
centralus
eastus
eastus2
northcentralus
southcentralus
westus
westus2
northeurope
westeurope
uksouth
japaneast
meta-llama/Llama-3.1-70B australiaeast
centralus
eastus
eastus2
northcentralus
southcentralus
westus
westus2
northeurope
westeurope
uksouth
japaneast
meta-llama/Llama-3.1-70B-Instruct australiaeast
centralus
eastus
eastus2
northcentralus
southcentralus
westus
westus2
northeurope
westeurope
uksouth
japaneast
meta-llama/Llama-3.1-405B australiaeast
centralus
eastus
eastus2
northcentralus
southcentralus
westus
westus2
northeurope
westeurope
uksouth
japaneast
meta-llama/Llama-3.1-405B-Instruct australiaeast
centralus
eastus
eastus2
northcentralus
southcentralus
westus
westus2
northeurope
westeurope
uksouth
japaneast
meta-llama/Llama-3.2-1B australiaeast
centralus
eastus
eastus2
northcentralus
southcentralus
westus
westus2
northeurope
westeurope
uksouth
japaneast
meta-llama/Llama-3.2-1B-Instruct australiaeast
centralus
eastus
eastus2
northcentralus
southcentralus
westus
westus2
northeurope
westeurope
uksouth
japaneast
meta-llama/Llama-3.2-3B australiaeast
centralus
eastus
eastus2
northcentralus
southcentralus
westus
westus2
northeurope
westeurope
uksouth
japaneast
meta-llama/Llama-3.2-3B-Instruct australiaeast
centralus
eastus
eastus2
northcentralus
southcentralus
westus
westus2
northeurope
westeurope
uksouth
japaneast
meta-llama/Llama-3.3-70B australiaeast
centralus
eastus
eastus2
northcentralus
southcentralus
westus
westus2
northeurope
westeurope
uksouth
japaneast
meta-llama/Llama-4-17B australiaeast
*centralus
eastus
eastus2
northcentralus
southcentralus
westus
westus2
northeurope
westeurope
uksouth
*japaneast
*
* Model is available in this region only when cross geography routing is enabled.
Region availability
Note
If you require an endpoint in an unsupported region, reach out to your Azure Databricks account team.
If your workspace is deployed in a region that supports model serving but is served by a control plane in an unsupported region, the workspace does not support model serving. If you attempt to use model serving in such a workspace, you will see in an error message stating that your workspace is not supported. Reach out to your Azure Databricks account team for more information.
For more information on regional availability of features, see Model serving regional availability.