performance issues while using the Azure OpenAI API

Mahsa 15 Reputation points
2025-03-05T14:18:17.46+00:00

We are experiencing performance issues while using the Azure OpenAI API, specifically with the gpt-4o deployment. Our application processes various document types (PDFs, images, text, and spreadsheets) and makes API calls for text and image analysis. However, we have noticed the following issues:

  1. Slow Response Times: The API sometimes takes significantly longer to return a response, even for simple text queries.
  2. Inconsistent Latency: Some requests are processed quickly, while others take much longer, even under similar load conditions.
  3. Performance with PDFs & Images: Processing PDFs (converted to images) seems particularly slow. Are there any known optimizations for handling large files?
  4. Caching Options: Is there a way to enable caching to improve response times for repeated queries?

We are using the gpt-4o model and calling the API via AzureOpenAI. Please advise if there are any optimizations, best practices, or recommended configurations to enhance performance.

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,950 questions
{count} vote

1 answer

Sort by: Most helpful
  1. SriLakshmi C 4,795 Reputation points Microsoft External Staff
    2025-03-06T06:06:13.81+00:00

    Hello Mahsa,

    Greetings and Welcome to Microsoft Q&A!

    I understand that you are facing performance issues with the Azure OpenAI API, particularly with the gpt-4o model. Here are some best practice recommendations to consider.

    Make sure you are using the latest and most optimized model for your needs. For lower latency, the GPT-4o Mini model is recommended. If you haven’t already, consider switching to this model to improve performance.
    Reducing the max_tokens parameter can significantly improve response time. The fewer tokens the model generates, the faster the response. To enhance performance, set this parameter to the smallest value necessary for your requests.

    Enable streaming by setting stream: true in your requests. This allows tokens to be returned as they are generated, enhancing the perceived response time for end-users.

    If you need to make multiple requests, try batching them into a single call. This can help reduce the number of requests and improve overall response times.

    When processing large files like PDFs, optimize your ingestion process to minimize latency. If converting PDFs to images, evaluate whether the conversion is causing delays. Consider adjusting the chunk size or using optimized methods for handling tables and structured data when applicable.

    Combining different workloads on the same endpoint can impact latency. To enhance performance, consider separating deployments based on workload types whenever possible.

    Azure OpenAI does not currently offer built-in caching for API calls. However, you can implement a custom caching mechanism at the application level to store and retrieve responses efficiently for repeated queries.

    Kindly refer this Improve performance.

    I hope you understand. And, if you have any further query do let us know.

    Thank you!

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.