Request for Technical Support & Reference Architecture for Azure-Based Voice Call Automation Pipeline

Mian Omair 0 Reputation points
2025-04-25T20:11:58.7+00:00

We are developing an end-to-end voice AI automation pipeline for a healthcare provider using Azure-native components. The solution will be agentic, with AI agents orchestrating healthcare careflows in the future—leveraging Azure Communication Services (ACS), Azure Cognitive Services (TTS/STT), and Azure AI Foundry (LLMs or orchestration agents).

The current architecture (attached below) includes:

  • Outbound call orchestration via CallAutomationClient
  • TTS via Azure Cognitive Services
  • STT for speech analytics (coming soon)
  • Event-driven logic over HTTP using Azure Functions + FastAPI
  • Call lifecycle handling (connect, media play, disconnect) via custom webhook callbacks

User's image

Our current architecture (attached below) supports outbound PSTN voice calls using CallAutomationClient, but each layer—from audio synthesis to call event handling—has been built manually due to a lack of an end-to-end reference architecture or orchestration template.

Challenges & Request:

We're building every component manually because no cohesive reference implementation exists. Azure Copilot and documentation provide only fragmented or outdated code samples, particularly around CallConnectionClient, webhook retries and media playback timing.

What We Are Looking For:

A reference architecture or sample repo for a similar solution that demonstrates:

  • Outbound voice calls with real-time event handling

SMS or chat messaging flows managed by agents or functions

TTS and STT pipelines within active call sessions

A modular event system powered by Azure agents or durable workflows

Note: We are not committed to our current build—if Microsoft has a more scalable or modern reference architecture, we are open to redesigning the system around that.

Azure Communication Services
Azure Communication Services
An Azure communication platform for deploying applications across devices and platforms.
1,129 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Suresh Chikkam 1,330 Reputation points Microsoft External Staff
    2025-04-30T05:40:04.19+00:00

    Hi Mian Omair,

    Right now, Azure doesn’t have one complete example that shows everything you’re trying to do all in one place. But it’s still possible to build what you need by connecting the right services together.

    For making outbound phone calls, the CallAutomationClient lets you control the call and listen for events like when someone answers or a message finish playing. You can handle these events using webhooks with something like Azure Functions or FastAPI, just like you're doing now.

    To speak to the caller, you can use Text-to-Speech (TTS) to turn messages into audio. If you want to understand what the caller says, you’ll need to use ACS Media Streaming. This lets you stream the audio and send it to Speech-to-Text (STT) for live transcription.

    To manage the full flow like playing a message, waiting for input, then responding you can use Azure Durable Functions or Logic Apps. These help you organize and control what happens step-by-step.

    Even though there isn’t a single GitHub repo with all of this combined, Microsoft does have a helpful sample that shows how to connect ACS with Azure OpenAI for dynamic voice calls - Call Automation + Azure OpenAI Sample (JavaScript)

    And here’s a good reference on how call automation works in general - ACS Call Automation Concepts

    Hope it helps!


    Please do not forget to click "Accept the answer” and Yes wherever the information provided helps you, this can be beneficial to other community members.

    User's image

    If you have any other questions or still running into more issues, let me know in the "comments" and I would be happy to help you.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.