Understanding fail over and fail back in disaster recovery

Question

Understanding fail over and fail back in disaster recovery

Handinata Tanudjaja 230

Hi everyone,

I would like to make sure I have a proper understanding of disaster recovery process.
The following are my understanding:

When Microsoft declares a disaster in the primary region, the fail over from primary to the secondary region is initiated.
This will make the secondary becomes the new primary and the original primary becomes new secondary.
Fail over cannot be initiated before Microsoft declares the region as disaster unless Cross Region Restore is enabled.
It's possible to initiate a fail back from the new primary to the new secondary, which will make the new secondary becomes primary again.
However, it's important to check Last Sync Time value before failing back in order to avoid major data loss.

Am I capturing the high level points of fail over and fail back in disaster recovery?
Thank you

Mohan Krishna T Sreeramulu 80 Reputation points Microsoft External Staff

2025-04-29T23:12:24.1633333+00:00
Hello Handinata Tanudjaja,

Welcome to Microsoft Q&A!

Failover and Failback processes in disaster recovery. You're definitely on the right track! Here’s a breakdown of your understanding with some clarifications:

Failover Initiation: You’ve mentioned that failover occurs when Microsoft declares a disaster in the primary region. That’s correct! The secondary region becomes primary, and the original primary becomes secondary. Just as you noted, failover cannot be initiated by customers before a disaster is declared unless Cross Region Restore is enabled.

Failback: Your understanding of failback is also spot on. It’s crucial to check the Last Sync Time before falling back to ensure data consistency and prevent data loss. You’ve identified critical aspects of both failover and failback processes confidently.

In addition to what you've shared, here are a few key points

Data Synchronization: During failover, some data may still be written in the original primary. This can lead to synchronization issues that need to be resolved during failback.

Downtime: Failback may sometimes involve extra downtime compared to failover, due to data verification and remediation steps.

Overall, you’ve captured the high-level points quite well! If you have any specific scenarios or concerns you’d like to discuss, feel free to share.

Hope this helps clarify things for you! If you need more details or have specific questions, you can share with us

References:

What are business continuity, high availability, and disaster recovery?

Common questions about Azure-to-Azure disaster recovery

If this answers your query, please click on 'Accept Answer'. This helps us to continuously improve the quality and relevance of our solutions. If you need further clarification, please feel free to let us know, as we are always here to help whenever you need us.
Handinata Tanudjaja 230 Reputation points

2025-04-30T00:41:08.6566667+00:00

Hi @Mohan Krishna T Sreeramulu ,
Thank you for your reply!

On your additional response about data synchronization issue, does Microsoft provide any service to resolved the synchronization during fail back?
And when you mentioned synchronization, you were referring to data verification and remediation steps?

Thanks
Mohan Krishna T Sreeramulu 80 Reputation points Microsoft External Staff

2025-04-30T20:21:27.41+00:00

Hi Handinata Tanudjaja,

During the failback process, Microsoft does not provide a specific service to automatically resolve data synchronization issues that may arise. Instead, it is often necessary to manage these issues manually. Data synchronization issues can occur if the original primary instance has continued to write data during the failover. Part of the failback process involves ensuring data consistency, which may require manual intervention to handle conflicts or duplications.

The synchronization mentioned refers to both data verification and remediation steps. Remediation may involve resetting databases or other states if there are conflicts and ensuring that the primary instance is in a known good state before proceeding with the failback.

As Ashok Gandhi Kotnana, Mentioned, Virtual Machines (VMs) protected with Azure Site Recovery (ASR), failover can be initiated at any time — without needing Microsoft to declare a disaster.

If you want to know more, please refer to the following link for detailed information on failover, reprotection, and failback processes in Azure Site Recovery.

Reference:
https://learn.microsoft.com/en-us/azure/site-recovery/azure-to-azure-tutorial-failback.

https://learn.microsoft.com/en-us/azure/site-recovery/site-recovery-faq

If you found the answer helpful, please click on 'Accept Answer' Let me know. If you encountered any issues or need further clarification, please feel free to let us know, as we are always here to help whenever you need us.
Mohan Krishna T Sreeramulu 80 Reputation points Microsoft External Staff

2025-05-02T20:06:33.7933333+00:00

Hi Handinata Tanudjaja,

Just checking in to see if the above provided answer provided by Ashok Gandhi Kotnana helped.

If this answer clarifies your query, please click on 'Accept Answer'. This helps us to continuously improve the quality and relevance of our solutions. If you need further clarification, please feel free to let us know, as we are always here to help whenever you need us.

1 answer

Your answer

Mohan Krishna T Sreeramulu 80 Reputation points Microsoft External Staff

2025-04-29T23:12:24.1633333+00:00

Hello Handinata Tanudjaja,

Welcome to Microsoft Q&A!

Failover and Failback processes in disaster recovery. You're definitely on the right track! Here’s a breakdown of your understanding with some clarifications:

Failover Initiation: You’ve mentioned that failover occurs when Microsoft declares a disaster in the primary region. That’s correct! The secondary region becomes primary, and the original primary becomes secondary. Just as you noted, failover cannot be initiated by customers before a disaster is declared unless Cross Region Restore is enabled.

Failback: Your understanding of failback is also spot on. It’s crucial to check the Last Sync Time before falling back to ensure data consistency and prevent data loss. You’ve identified critical aspects of both failover and failback processes confidently.

In addition to what you've shared, here are a few key points

Data Synchronization: During failover, some data may still be written in the original primary. This can lead to synchronization issues that need to be resolved during failback.

Downtime: Failback may sometimes involve extra downtime compared to failover, due to data verification and remediation steps.

Overall, you’ve captured the high-level points quite well! If you have any specific scenarios or concerns you’d like to discuss, feel free to share.

Hope this helps clarify things for you! If you need more details or have specific questions, you can share with us

References:

What are business continuity, high availability, and disaster recovery?

Common questions about Azure-to-Azure disaster recovery

If this answers your query, please click on 'Accept Answer'. This helps us to continuously improve the quality and relevance of our solutions. If you need further clarification, please feel free to let us know, as we are always here to help whenever you need us.
Handinata Tanudjaja 230 Reputation points

2025-04-30T00:41:08.6566667+00:00

Hi @Mohan Krishna T Sreeramulu ,
Thank you for your reply!

On your additional response about data synchronization issue, does Microsoft provide any service to resolved the synchronization during fail back?
And when you mentioned synchronization, you were referring to data verification and remediation steps?

Thanks
Mohan Krishna T Sreeramulu 80 Reputation points Microsoft External Staff

2025-04-30T20:21:27.41+00:00

Hi Handinata Tanudjaja,

During the failback process, Microsoft does not provide a specific service to automatically resolve data synchronization issues that may arise. Instead, it is often necessary to manage these issues manually. Data synchronization issues can occur if the original primary instance has continued to write data during the failover. Part of the failback process involves ensuring data consistency, which may require manual intervention to handle conflicts or duplications.

The synchronization mentioned refers to both data verification and remediation steps. Remediation may involve resetting databases or other states if there are conflicts and ensuring that the primary instance is in a known good state before proceeding with the failback.

As Ashok Gandhi Kotnana, Mentioned, Virtual Machines (VMs) protected with Azure Site Recovery (ASR), failover can be initiated at any time — without needing Microsoft to declare a disaster.

If you want to know more, please refer to the following link for detailed information on failover, reprotection, and failback processes in Azure Site Recovery.

Reference:
https://learn.microsoft.com/en-us/azure/site-recovery/azure-to-azure-tutorial-failback.

https://learn.microsoft.com/en-us/azure/site-recovery/site-recovery-faq

If you found the answer helpful, please click on 'Accept Answer' Let me know. If you encountered any issues or need further clarification, please feel free to let us know, as we are always here to help whenever you need us.
Mohan Krishna T Sreeramulu 80 Reputation points Microsoft External Staff

2025-05-02T20:06:33.7933333+00:00

Hi Handinata Tanudjaja,

Just checking in to see if the above provided answer provided by Ashok Gandhi Kotnana helped.

If this answer clarifies your query, please click on 'Accept Answer'. This helps us to continuously improve the quality and relevance of our solutions. If you need further clarification, please feel free to let us know, as we are always here to help whenever you need us.

Answer 1

Hi @Handinata Tanudjaja

Thanks for the response back

Services like Azure Storage (with GRS) and Azure SQL (with Auto-failover groups) support geo-replication to a paired region. However, Microsoft controls the failover process and will only promote the secondary region to primary if the source region is officially declared a disaster. Until then, manual failover is not possible for these platform-managed services.

For Virtual Machines (VMs) protected with Azure Site Recovery (ASR), failover can be initiated at any time — without needing Microsoft to declare a disaster.

You have full control over the failover process, which includes:

Test Failover: Simulates failover to validate your recovery plans.

Planned Failover: Used for maintenance or pre-scheduled downtime.

Unplanned Failover: Triggered during unexpected outages.

The replication policy plays a key role in defining the Recovery Point Objective (RPO) and governs the types of recovery points:

1.Crash-Consistent Recovery Points: Captures the on-disk state at the time of the snapshot — like pulling the power plug on a server. These do not include data in memory or in-flight transactions.

2.Application-Consistent Recovery Points: Include all the data in a crash-consistent snapshot, plus in-memory data, in-flight transactions, and application state. These are created using VSS snapshots (Windows) or custom scripts (Linux).

3.By default, ASR creates crash-consistent snapshots every 5 minutes. App-consistent snapshots are created less frequently, based on OS and workload capability.

Reprotection and Failback:

1.Commit Failover After validating that the failovered VM is functioning correctly, commit the failover. This designates the VM as the new primary, and the original primary is deactivated.

2.Reprotection The replication direction is reversed, and the failovered VM begins replicating back to the original region using the same replication policy.

3.Failback Once the original region is operational, you can initiate failback. The Recovery Services Vault automatically synchronizes all delta changes between the fail overed VM and the original source VM.

Note: Azure Site Recovery uses disk-level replication, ensuring disk consistency throughout reprotection and failback.

Please refer to the following link for detailed information on failover, reprotection, and failback processes in Azure Site Recovery.

Reference:
https://learn.microsoft.com/en-us/azure/site-recovery/azure-to-azure-tutorial-failback

Hope I have answered your questions!

If you found this informative, please consider accepting an answer and upvote it as a token of appreciation. And don't forget to give it a thumbs up 👍 if it was helpful.

Ashok Gandhi Kotnana 6,355 Reputation points Microsoft External Staff

2025-05-02T05:25:51.9733333+00:00

Hi @Handinata Tanudjaja

Just want to check if the above answer worked for you or else please let us know if any help, we are always here to help whenever you need us.

Please do not forget to "Accept the answer” and “upvote it” wherever the information provided helps you, this can be beneficial to other community members.it would be greatly appreciated and helpful to others.
Handinata Tanudjaja 230 Reputation points

2025-05-02T19:22:37.8+00:00

Hi @Ashok Gandhi Kotnana ,
Thank you very much for your explanation!
I would like to make sure on the replication.
Replication could be done for any Azure resources, correct? Not just VM, Azure Storage, and Azure SQL.

Thank you
Ashok Gandhi Kotnana 6,355 Reputation points Microsoft External Staff

2025-05-02T19:36:57.0366667+00:00

Hi @Handinata Tanudjaja

Replication in Azure is primarily focused on specific resources such as virtual machines (VMs), Azure Storage, and Azure SQL. While Azure offers various services that can be replicated for disaster recovery, not all Azure resources support replication. For instance, Azure Site Recovery specifically provides support for VMs and their disks, and integrates with certain applications like SharePoint, Exchange, and SQL Server for application-aware replication. However, the replication capabilities may vary depending on the type of resource and its specific requirements.

While VM replication, Azure Storage replication, and Azure SQL Database replication are well-known examples, other Azure services like Azure Kubernetes Service (AKS), Azure Cosmos DB, and Azure Event Grid also offer replication features.

However, the exact replication method and availability depend on the service. Some services use geo-redundancy, while others rely on zone-redundancy or data synchronization mechanisms.

Please let me know if you face any challenge here, I can help you to resolve this issue further

Please do not forget to "Accept the answer” and “upvote it” wherever the information provided helps you, this can be beneficial to other community members.it would be greatly appreciated and helpful to others.

Share via

Understanding fail over and fail back in disaster recovery

1 answer

Your answer