What’s the Best Way to Handle Interruptions in Azure Communication Services?

Question

What’s the Best Way to Handle Interruptions in Azure Communication Services?

Mian Omair 0

Hi all,

We’re building a voice-based AI assistant using Azure Communication Services (ACS), and we're exploring how to enable seamless interruption handling during calls.

Use Case:

We want the assistant to speak using TTS (e.g., play_media(...)) and be interrupted mid-sentence if the user begins speaking — just like a natural phone conversation. This is essential for delivering a fluid and responsive experience in real-world scenarios like customer support or healthcare coordination.

Questions:

What are the recommended patterns or best practices in ACS for enabling interruptible speech playback?

Is there any way to implement barge-in-like behavior using the Python SDK or REST APIs?

Are there any newer or upcoming versions of azure-communication-callautomation that introduce native support for this (e.g., play_and_recognize() with bargeInAllowed)?

Are there any working examples or documented approaches for building responsive, real-time voice interactions with interruption support in ACS?

We’ve experimented with start_recognizing_media(...) in parallel with play_media(...) and monitoring for events to simulate interruption, but that seems limited or unsupported in the current SDK version (1.4.0b1).

We’re looking for Azure’s current and future guidance on handling this pattern reliably.

Thanks in advance!

Siva Nair 1,475 Reputation points Microsoft External Staff

2025-04-30T06:45:02.3766667+00:00
Hi Mian Omair,

I appreciate that you raised this about ACS, we keep looking for feedback and your concern says a scope of enhancement. I understand there are some limitations for now, adding your concern to our product roadmap to prioritize support for this feature in the future.

Please check the below points with reference link , which might help.

As you mentioned using play_media(...), to allow users to interrupt speech mid-sentence, you can leverage a combination of speech recognition and media playback. The idea is to run start_recognizing_media(...) alongside play_media(...) to listen for user input while TTS is ongoing. Although you’ve encountered limitations with this in the current SDK version (1.4.0b1), it’s typically how interruption handling is approached.

Unfortunately, native support for “barge-in” isn't available yet in the current version of the azure-communication-callautomation SDK like it is in voice platforms specifically designed for it. However, implementing your own interrupt logic using event listeners might be your best bet for now, even though it does have its drawbacks.

You can check the Microsoft Bot Framework samples for implementations related to handling interruptions in conversations. While they may not directly translate to ACS, they can provide valuable patterns and code snippets you can adapt for your needs.

few patterns:

State Management: Keep track of where the user is in the conversation so you can resume the dialogue appropriately after an interruption.

Responsive Design: Use a responsive dialog design that can gracefully handle interruptions without losing the thread of the conversation.

Feedback Mechanism: Always provide feedback to the user so they know the system is listening and can bring them back to the previous context after an interruption.

handle-user-interrupt-python

https://github.com//microsoft/botbuilder-python?tab=readme-ov-file

Different approach-

Build your own real-time voice agent

Next-Gen Voice Bots: Human-Like Interaction with Azure Speech

Conversational AI with ACS and Azure OpenAI

Thanks!
Mian Omair 0 Reputation points

2025-04-30T17:10:01.36+00:00

Hi @Siva Nair , I see only limited ACS events like CallConnected, ParticipantsUpdated, PlayStarted, PlayCompleted, and CallDisconnected but these all seem limited to call state and media playback.

Since you mentioned building interrupt logic on events, could you clarify which specific events are available (or expected) for recognizing user input during media playback—especially for STT?

Are there other webhook events or SDK-level hooks (e.g., from start_recognizing_media) that we can listen to in real time? Also, what would be the best way to extract those events in the current SDK version (1.4.0b1)?
Prabhavathi Manchala 1,050 Reputation points Microsoft External Staff

2025-04-30T18:26:35.3166667+00:00
Hi Mian Omair,

I understand you're looking for events that can detect user input like speech during media playback, especially to handle interruptions in Speech-to-Text (STT) with start_recognizing_media.

The current SDK (version 1.4.0b1) only provides basic call and playback events, like CallConnected and PlayStarted, and doesn't support detecting real-time speech during media playback. This makes capturing input during TTS and handling speech interruptions with start_recognizing_media and play_media unreliable.

To detect speech input, use start_recognizing_media for speech recognition and listen for events like RecognizingSpeech or RecognizedSpeech to trigger actions when the user speaks.

Since ACS doesn’t support interruptible speech playback, the best approach is to use start_recognizing_media with TTS playback, listen for speech events like RecognizingSpeech, and pause or stop playback when speech is detected to process the input.

The current SDK doesn’t fully support this, but you can manage the logic by running start_recognizing_media and play_media in parallel. This event-based approach can serve as a temporary solution, though the SDK version you’re using may not fully support the desired behaviour.

Your answer

Mian Omair 0 Reputation points

2025-04-30T17:10:01.36+00:00

Hi @Siva Nair , I see only limited ACS events like CallConnected, ParticipantsUpdated, PlayStarted, PlayCompleted, and CallDisconnected but these all seem limited to call state and media playback.

Since you mentioned building interrupt logic on events, could you clarify which specific events are available (or expected) for recognizing user input during media playback—especially for STT?

Are there other webhook events or SDK-level hooks (e.g., from start_recognizing_media) that we can listen to in real time? Also, what would be the best way to extract those events in the current SDK version (1.4.0b1)?
Prabhavathi Manchala 1,050 Reputation points Microsoft External Staff

2025-04-30T18:26:35.3166667+00:00

Hi Mian Omair,

I understand you're looking for events that can detect user input like speech during media playback, especially to handle interruptions in Speech-to-Text (STT) with start_recognizing_media.

The current SDK (version 1.4.0b1) only provides basic call and playback events, like CallConnected and PlayStarted, and doesn't support detecting real-time speech during media playback. This makes capturing input during TTS and handling speech interruptions with start_recognizing_media and play_media unreliable.

To detect speech input, use start_recognizing_media for speech recognition and listen for events like RecognizingSpeech or RecognizedSpeech to trigger actions when the user speaks.

Since ACS doesn’t support interruptible speech playback, the best approach is to use start_recognizing_media with TTS playback, listen for speech events like RecognizingSpeech, and pause or stop playback when speech is detected to process the input.

The current SDK doesn’t fully support this, but you can manage the logic by running start_recognizing_media and play_media in parallel. This event-based approach can serve as a temporary solution, though the SDK version you’re using may not fully support the desired behaviour.

Share via

What’s the Best Way to Handle Interruptions in Azure Communication Services?

Your answer