Making token limit accessible for AI Foundary Assistant

Question

Making token limit accessible for AI Foundary Assistant

Tejwant Kaur 0

Hello Azure Support Team,

I’m currently developing a HIPAA-aligned healthcare platform in the, Australia east region, that can get transcripts of upto an hour or conversation,using the Azure OpenAI Assistant (via Threads API) for clinical report generation. Our application architecture depends heavily on:

Persistent thread history

Role-based system messages

Multi-turn assistant interactions

We’ve deployed our models with the maximum token limit (e.g. GPT-4o with 20K context), but when we create Assistants through the Azure Foundry UI, we’re unable to configure max_tokens. The resultant prompt is sometimes truncated. As a result, our ability to generate long-form structured outputs for medical purposes

🙏 Request:

Expose full assistant configuration options in Azure AI Studio or allow Assistant creation via SDK (like in OpenAI's platform):

max_tokens, temperature, tool_choice, metadata, instructions, etc.

Ensure Assistant inherits deployment model's token ceiling and allows override when used via Threads API.

Provide guidance or roadmap if token configuration and memory features are planned in upcoming updates.

Thank you for your support — this will enable us to build secure, compliant, and more powerful AI healthcare solutions within the Azure ecosystem.

Best regards,

This is critically limiting our ability to generate long-form structured outputs such as geriatric assessments or dementia care plans — which are central to products like ClinicalInsightsAI and ALMA (our personalized dementia assistant).

🙏 Request:

Expose full assistant configuration options in Azure AI Studio or allow Assistant creation via SDK (like in OpenAI's platform):

max_tokens, temperature, tool_choice, metadata, instructions, etc.

Ensure Assistant inherits deployment model's token ceiling and allows override when used via Threads API.

Provide guidance or roadmap if token configuration and memory features are planned in upcoming updates.

Thank you for your support — this will enable us to build secure, compliant, and more powerful AI healthcare solutions within the Azure ecosystem.

santoshkc 14,415 Reputation points Microsoft External Staff

2025-04-24T17:16:39.29+00:00

Hi @Tejwant Kaur,

To addresses the issue truncating outputs, you can mention in user prompt to keep the answer under certain word limit. For e.g. Please answer about Max under 300 words. It is instruction param for assistant.

You can also adjust max_prompt_tokens and max_completion_tokens in thread run. Azure OpenAI Service Assistants API concepts - Azure OpenAI Service | Microsoft Learn. Assistant does provide -tool_choice parameter lets you force the Assistant to use a specified tool. You can also create messages with the assistant role to create custom conversation histories in Threads. You can use temperature, top_p, response_format in your response body. Azure OpenAI Service REST API preview reference - Azure OpenAI | Microsoft Learn. Instruction is supported while creating message and assistant.

Getting started with Azure OpenAI Assistants (Preview).

Thank you.
santoshkc 14,415 Reputation points Microsoft External Staff

2025-04-25T17:18:06.65+00:00

Hi @Tejwant Kaur,

Following up to see if the given response was helpful. And, if you have any further query do let us know.

Thank you.
Tejwant Kaur 0 Reputation points

2025-04-26T03:41:46.6566667+00:00
Hi Santosh, Thank you so much for your detailed reply — I really appreciate it.

I'm building a healthcare application where doctor–patient consultations can span 20–90 minutes, and detailed, structured clinical reports (upto ~1000+ words) are required from these conversations.

Given the variability and length of the consultation transcripts, reducing every prompt to under 300 words reliably isn’t feasible without losing medically important information.

I would like to clarify two points:

Prompt Truncation:

Is there a way to gracefully detect or avoid partial inputs being submitted, especially when dealing with critical medical data?

Quota Adjustments:

Our current global quota is 50k tokens per minute.

Would it be possible to request a higher per-request token limit (e.g., 80k–100k tokens) for our Assistant deployments, or increase the TPM quota to accommodate large consultations?

If yes, could you advise on the specific steps for submitting a request for this?

Ultimately, our goal is to ensure the generation of complete, reliable clinical reports without risking data loss due to hidden prompt truncation.

Thank you again for your support — looking forward to your guidance.

Best regards,
Manas Mohanty 3,210 Reputation points Microsoft External Staff

2025-04-29T14:48:10.2933333+00:00

Hi Tejwant Kaur

We are checking to see if you have followed up query on this context

Thank you
santoshkc 14,415 Reputation points Microsoft External Staff

2025-04-30T14:02:00.7833333+00:00

Hi @Tejwant Kaur,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others.

Thank you.
Tejwant Kaur 0 Reputation points

2025-05-01T00:06:26.0966667+00:00

thanks Santosh. I might have to fill out the Quota increase form. Would you be able to give me some pointers for that?
Manas Mohanty 3,210 Reputation points Microsoft External Staff

2025-05-01T10:32:29.9+00:00

Hi Tejwant Kaur

Please request for atleast 100k to 200 K TPM, 20k TPM window is bit small for input with 1000 tokens.

You can use exponential backoff retry logic with modification of above params.I have converted previous comment to answer. Please accept if you appreciated our inputs.

Thank you.
Pavankumar Purilla 6,940 Reputation points Microsoft External Staff

2025-05-02T07:33:06.8533333+00:00

Hi Tejwant Kaur,
Did you get any chance to check the response. Thank you!
Tejwant Kaur 0 Reputation points

2025-05-03T02:20:17.4133333+00:00

Thanks Manas. If I request for a quota increase, does it again get divided between different assistants/agents or its allocated to that particular project/assitant/agent?

1 answer

Your answer

santoshkc 14,415 Reputation points Microsoft External Staff

2025-04-24T17:16:39.29+00:00

Hi @Tejwant Kaur,

To addresses the issue truncating outputs, you can mention in user prompt to keep the answer under certain word limit. For e.g. Please answer about Max under 300 words. It is instruction param for assistant.

You can also adjust max_prompt_tokens and max_completion_tokens in thread run. Azure OpenAI Service Assistants API concepts - Azure OpenAI Service | Microsoft Learn. Assistant does provide -tool_choice parameter lets you force the Assistant to use a specified tool. You can also create messages with the assistant role to create custom conversation histories in Threads. You can use temperature, top_p, response_format in your response body. Azure OpenAI Service REST API preview reference - Azure OpenAI | Microsoft Learn. Instruction is supported while creating message and assistant.

Getting started with Azure OpenAI Assistants (Preview).

Thank you.
santoshkc 14,415 Reputation points Microsoft External Staff

2025-04-25T17:18:06.65+00:00

Hi @Tejwant Kaur,

Following up to see if the given response was helpful. And, if you have any further query do let us know.

Thank you.
Tejwant Kaur 0 Reputation points

2025-04-26T03:41:46.6566667+00:00

Hi Santosh, Thank you so much for your detailed reply — I really appreciate it.

I'm building a healthcare application where doctor–patient consultations can span 20–90 minutes, and detailed, structured clinical reports (upto ~1000+ words) are required from these conversations.

Given the variability and length of the consultation transcripts, reducing every prompt to under 300 words reliably isn’t feasible without losing medically important information.

I would like to clarify two points:

Prompt Truncation:

Is there a way to gracefully detect or avoid partial inputs being submitted, especially when dealing with critical medical data?

Quota Adjustments:

Our current global quota is 50k tokens per minute.

Would it be possible to request a higher per-request token limit (e.g., 80k–100k tokens) for our Assistant deployments, or increase the TPM quota to accommodate large consultations?

If yes, could you advise on the specific steps for submitting a request for this?

Ultimately, our goal is to ensure the generation of complete, reliable clinical reports without risking data loss due to hidden prompt truncation.

Thank you again for your support — looking forward to your guidance.

Best regards,
Manas Mohanty 3,210 Reputation points Microsoft External Staff

2025-04-29T14:48:10.2933333+00:00

Hi Tejwant Kaur

We are checking to see if you have followed up query on this context

Thank you
santoshkc 14,415 Reputation points Microsoft External Staff

2025-04-30T14:02:00.7833333+00:00

Hi @Tejwant Kaur,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others.

Thank you.
Tejwant Kaur 0 Reputation points

2025-05-01T00:06:26.0966667+00:00

thanks Santosh. I might have to fill out the Quota increase form. Would you be able to give me some pointers for that?
Manas Mohanty 3,210 Reputation points Microsoft External Staff

2025-05-01T10:32:29.9+00:00

Hi Tejwant Kaur

Please request for atleast 100k to 200 K TPM, 20k TPM window is bit small for input with 1000 tokens.

You can use exponential backoff retry logic with modification of above params.I have converted previous comment to answer. Please accept if you appreciated our inputs.

Thank you.
Pavankumar Purilla 6,940 Reputation points Microsoft External Staff

2025-05-02T07:33:06.8533333+00:00

Hi Tejwant Kaur,
Did you get any chance to check the response. Thank you!
Tejwant Kaur 0 Reputation points

2025-05-03T02:20:17.4133333+00:00

Thanks Manas. If I request for a quota increase, does it again get divided between different assistants/agents or its allocated to that particular project/assitant/agent?

Answer 1

Manas Mohanty 3,210 Microsoft External Staff

Hi Tejwant Kaur

Thank you for replying.

Keeping under 300 words was an example to do it prompts.

You can set 1000 words or more but below less than Token per minute quota set in deployment side to avoid the truncation in outputs.

As mentioned, you can also adjust the output through max_prompt_tokens and max_completion_tokens in the thread run.

You can submit the details on your requirements here on Azure OpenAI quota increase form.

Thank you.

Share via

Making token limit accessible for AI Foundary Assistant

1 answer

Your answer