Azure MySQL Flexible Server v8.0 switching to failover instance frequently stating health check failure

Youvashri 0 Reputation points
2025-04-28T16:02:41.3066667+00:00

Hello!

We are hosting our production database on Azure Database for MySQL – Flexible Server (D2ads v5) with zone redundant High Availability (HA) enabled. Recently, we upgraded the database engine from MySQL 5.7 to MySQL 8.0.

Since the upgrade, we have been experiencing frequent failovers — approximately 10-15 times — each triggered by health check failures stating unplanned failover. The failover database also takes 2-3 minutes to be online. Monitoring shows that CPU and memory usage remain stable, typically between 60%–80%. Additionally, we do not observe a significant number of slow-running queries.

Upon checking the database error logs, we noticed entries such as:

"Database was not shut down normally!"

"Starting crash recovery."

Despite this, we have been unable to pinpoint the root cause of the issue or find a clear path to troubleshoot further.

Could anyone advise on potential causes or suggest further steps to diagnose and resolve this problem? Attaching the error log file for reference.

Thank you in advance!

Azure Database for MySQL
Azure Database for MySQL
An Azure managed MySQL database service for app development and deployment.
949 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 19,616 Reputation points
    2025-04-29T17:37:20.2833333+00:00

    Hello Youvashri,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your Azure MySQL Flexible Server v8.0 switching to failover instance frequently stating health check failure.

    Regarding your explanations, there are couple of things you can do to resolve the issue:

    1. Check InnoDB Crash Recovery Behavior, MySQL 8.0 enhances crash recovery but may require tuning. Verify if innodb_force_recovery is set incorrectly - https://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html Because, MySQL 8.0 uses atomic DDL, which can cause longer recovery times if interrupted. Check for DDL transaction errors in logs https://dev.mysql.com/doc/refman/8.0/en/atomic-ddl.html
    2. Use SHOW REPLICA STATUS to check Seconds_Behind_Source to monitor replication logs, Azure HA uses replication; lag >5s can trigger failovers - https://learn.microsoft.com/en-us/azure/mysql/flexible-server/concepts-high-availability#monitoring Then, use Azure Metrics to monitor IOPS and storage latency. High latency can cause health check timeouts - https://learn.microsoft.com/en-us/azure/mysql/flexible-server/concepts-monitoring
    3. Let's investigate unclean shutdown set innodb_fast_shutdown = 0 to force a full purge/rollback on shutdown. This can prevent crash recovery loops - https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_fast_shutdown also query Azure’s sys.dm_os_ring_buffers for OOM events or forced instance terminations - https://learn.microsoft.com/en-us/azure/azure-sql/database/monitoring-tuning also, check for OOM (Out of Memory) or VM kill events:
         -- All the code snippet should be one after the other
         
         SELECT * FROM information_schema.innodb_trx;
         
         SHOW ENGINE INNODB STATUS;
         
         -- Azure sys tables can indicate host-level events:
         
         SELECT * FROM sys.dm_os_ring_buffers 
         WHERE ring_buffer_type = 'RING_BUFFER_EXCEPTION';
         
      
      For Azure Flexible Server Monitoring - https://learn.microsoft.com/en-us/azure/mysql/flexible-server/concepts-monitoring and for sys.dm_os_ring_buffers - https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-ring-buffers-transact-sql
    4. Manually trigger a failover via Azure Portal and measure recovery time. If it matches the observed 2–3 minutes, the issue is Azure-side HA orchestration - https://learn.microsoft.com/en-us/azure/mysql/flexible-server/how-to-manage-high-availability-portal Then, make sure the MySQL 8.0 parameter group aligns with Azure’s HA requirements (e.g., server_id, read_only settings) - https://learn.microsoft.com/en-us/azure/mysql/flexible-server/concepts-server-parameters.
    5. You can escalate to Microsoft with diagnostic logs via Priority Customer Support (PCS)

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.