Hello Mahsa,
Greetings and Welcome to Microsoft Q&A!
I understand that you are facing performance issues with the Azure OpenAI API, particularly with the gpt-4o model. Here are some best practice recommendations to consider.
Make sure you are using the latest and most optimized model for your needs. For lower latency, the GPT-4o Mini model is recommended. If you haven’t already, consider switching to this model to improve performance.
Reducing the max_tokens parameter can significantly improve response time. The fewer tokens the model generates, the faster the response. To enhance performance, set this parameter to the smallest value necessary for your requests.
Enable streaming by setting stream: true
in your requests. This allows tokens to be returned as they are generated, enhancing the perceived response time for end-users.
If you need to make multiple requests, try batching them into a single call. This can help reduce the number of requests and improve overall response times.
When processing large files like PDFs, optimize your ingestion process to minimize latency. If converting PDFs to images, evaluate whether the conversion is causing delays. Consider adjusting the chunk size or using optimized methods for handling tables and structured data when applicable.
Combining different workloads on the same endpoint can impact latency. To enhance performance, consider separating deployments based on workload types whenever possible.
Azure OpenAI does not currently offer built-in caching for API calls. However, you can implement a custom caching mechanism at the application level to store and retrieve responses efficiently for repeated queries.
Kindly refer this Improve performance.
I hope you understand. And, if you have any further query do let us know.
Thank you!