Azure MySQL Flexible Server v8.0 switching to failover instance frequently stating health check failure

Question

Azure MySQL Flexible Server v8.0 switching to failover instance frequently stating health check failure

Youvashri 0

Hello!

We are hosting our production database on Azure Database for MySQL – Flexible Server (D2ads v5) with zone redundant High Availability (HA) enabled. Recently, we upgraded the database engine from MySQL 5.7 to MySQL 8.0.

Since the upgrade, we have been experiencing frequent failovers — approximately 10-15 times — each triggered by health check failures stating unplanned failover. The failover database also takes 2-3 minutes to be online. Monitoring shows that CPU and memory usage remain stable, typically between 60%–80%. Additionally, we do not observe a significant number of slow-running queries.

Upon checking the database error logs, we noticed entries such as:

"Database was not shut down normally!"

"Starting crash recovery."

Despite this, we have been unable to pinpoint the root cause of the issue or find a clear path to troubleshoot further.

Could anyone advise on potential causes or suggest further steps to diagnose and resolve this problem? Attaching the error log file for reference.

Thank you in advance!

Mallaiah Sangi 650 Reputation points Microsoft External Staff

2025-04-28T16:38:10.9466667+00:00
Hi Youvashri

Greeting!

As I understand that, after upgradation from MySQL 5.7 to MySQL 8.0. experiencing frequent failovers.

It sounds like your Azure MySQL Flexible Server is experiencing frequent failovers due to health check failures. This could be related to high availability configurations, recent upgrades, or underlying infrastructure issues.

Here are a few things to check:

High Availability Configuration: Azure MySQL Flexible Server offers Zone-Redundant HA and Same-Zone HA options. If your server is configured for high availability, failovers can occur automatically when the primary instance fails health checks.

Recent Upgrades: Some users have reported frequent failovers after upgrading to MySQL 8.0. If this applies to your setup, reviewing logs and error messages might help pinpoint the issue.

Connection Pooling Considerations: If you're using connection pooling, failover events may require clearing connections to ensure new connections point to the correct primary server.

You can find more details on handling failover events and troubleshooting in the official Azure documentation link below.

https://learn.microsoft.com/en-us/azure/mysql/flexible-server/concepts-high-availability

If the issue persists, reviewing logs or sharing the error logs with us will allow us to investigate further. Let me know if you need help identifying relevant logs or analyzing specific errors.

Hope this information helps, please let us know if any queries.
Youvashri 0 Reputation points

2025-04-28T17:10:02.8533333+00:00

Hello Mallaiah Sangi,

Thank you for your response.

To add more context, we are not using any external connection pooling mechanisms. Our applications implement their own retry logic for database connections.

At this stage, we are looking to find whether the root cause is related to an OS-level or hardware-level issue on the Azure Flexible Server side, or if it is something caused by our application behavior.

Any suggestions for specific debugging steps or diagnostic checks that could help us classify and resolve the issue would be greatly appreciated.

I am also attaching the database error log file for reference. mysql-error-toolsaksdatabase-2025042706.log
PratikLad 960 Reputation points Microsoft External Staff

2025-05-01T12:00:06.2966667+00:00

Hi Youvashri, We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution

please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Youvashri 0 Reputation points

2025-05-02T02:30:52.4966667+00:00

Hi Pratik,

Still we don't happen to find out the exact root cause. But noticed, the DB is switching to failover when the active connections reaches above 300 itself. In-fact we tried scaling up the DB to D4as_v5 as well. And still having the issue. There is a strong opinion that some factor along with this connection limit is causing the issue :(

We tried implementing connection pooling in our applications as well, looks like that doesn't exactly solve the issue, only that it minimized the failover count to say 12 from 15 per day.

1 answer

Your answer

Mallaiah Sangi 650 Reputation points Microsoft External Staff

2025-04-28T16:38:10.9466667+00:00

Hi Youvashri

Greeting!

As I understand that, after upgradation from MySQL 5.7 to MySQL 8.0. experiencing frequent failovers.

It sounds like your Azure MySQL Flexible Server is experiencing frequent failovers due to health check failures. This could be related to high availability configurations, recent upgrades, or underlying infrastructure issues.

Here are a few things to check:

High Availability Configuration: Azure MySQL Flexible Server offers Zone-Redundant HA and Same-Zone HA options. If your server is configured for high availability, failovers can occur automatically when the primary instance fails health checks.

Recent Upgrades: Some users have reported frequent failovers after upgrading to MySQL 8.0. If this applies to your setup, reviewing logs and error messages might help pinpoint the issue.

Connection Pooling Considerations: If you're using connection pooling, failover events may require clearing connections to ensure new connections point to the correct primary server.

You can find more details on handling failover events and troubleshooting in the official Azure documentation link below.

https://learn.microsoft.com/en-us/azure/mysql/flexible-server/concepts-high-availability

If the issue persists, reviewing logs or sharing the error logs with us will allow us to investigate further. Let me know if you need help identifying relevant logs or analyzing specific errors.

Hope this information helps, please let us know if any queries.
Youvashri 0 Reputation points

2025-04-28T17:10:02.8533333+00:00

Hello Mallaiah Sangi,

Thank you for your response.

To add more context, we are not using any external connection pooling mechanisms. Our applications implement their own retry logic for database connections.

At this stage, we are looking to find whether the root cause is related to an OS-level or hardware-level issue on the Azure Flexible Server side, or if it is something caused by our application behavior.

Any suggestions for specific debugging steps or diagnostic checks that could help us classify and resolve the issue would be greatly appreciated.

I am also attaching the database error log file for reference. mysql-error-toolsaksdatabase-2025042706.log
PratikLad 960 Reputation points Microsoft External Staff

2025-05-01T12:00:06.2966667+00:00

Hi Youvashri, We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution

please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Youvashri 0 Reputation points

2025-05-02T02:30:52.4966667+00:00

Hi Pratik,

Still we don't happen to find out the exact root cause. But noticed, the DB is switching to failover when the active connections reaches above 300 itself. In-fact we tried scaling up the DB to D4as_v5 as well. And still having the issue. There is a strong opinion that some factor along with this connection limit is causing the issue :(

We tried implementing connection pooling in our applications as well, looks like that doesn't exactly solve the issue, only that it minimized the failover count to say 12 from 15 per day.

Answer 1

Hello Youvashri,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that your Azure MySQL Flexible Server v8.0 switching to failover instance frequently stating health check failure.

Regarding your explanations, there are couple of things you can do to resolve the issue:

Check InnoDB Crash Recovery Behavior, MySQL 8.0 enhances crash recovery but may require tuning. Verify if innodb_force_recovery is set incorrectly - https://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html Because, MySQL 8.0 uses atomic DDL, which can cause longer recovery times if interrupted. Check for DDL transaction errors in logs https://dev.mysql.com/doc/refman/8.0/en/atomic-ddl.html
Use SHOW REPLICA STATUS to check Seconds_Behind_Source to monitor replication logs, Azure HA uses replication; lag >5s can trigger failovers - https://learn.microsoft.com/en-us/azure/mysql/flexible-server/concepts-high-availability#monitoring Then, use Azure Metrics to monitor IOPS and storage latency. High latency can cause health check timeouts - https://learn.microsoft.com/en-us/azure/mysql/flexible-server/concepts-monitoring
Let's investigate unclean shutdown set innodb_fast_shutdown = 0 to force a full purge/rollback on shutdown. This can prevent crash recovery loops - https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_fast_shutdown also query Azure’s sys.dm_os_ring_buffers for OOM events or forced instance terminations - https://learn.microsoft.com/en-us/azure/azure-sql/database/monitoring-tuning also, check for OOM (Out of Memory) or VM kill events:
```
   -- All the code snippet should be one after the other
   
   SELECT * FROM information_schema.innodb_trx;
   
   SHOW ENGINE INNODB STATUS;
   
   -- Azure sys tables can indicate host-level events:
   
   SELECT * FROM sys.dm_os_ring_buffers 
   WHERE ring_buffer_type = 'RING_BUFFER_EXCEPTION';
   
```
For Azure Flexible Server Monitoring - https://learn.microsoft.com/en-us/azure/mysql/flexible-server/concepts-monitoring and for sys.dm_os_ring_buffers - https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-ring-buffers-transact-sql
Manually trigger a failover via Azure Portal and measure recovery time. If it matches the observed 2–3 minutes, the issue is Azure-side HA orchestration - https://learn.microsoft.com/en-us/azure/mysql/flexible-server/how-to-manage-high-availability-portal Then, make sure the MySQL 8.0 parameter group aligns with Azure’s HA requirements (e.g., server_id, read_only settings) - https://learn.microsoft.com/en-us/azure/mysql/flexible-server/concepts-server-parameters.
You can escalate to Microsoft with diagnostic logs via Priority Customer Support (PCS)

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Youvashri 0 Reputation points

2025-04-30T05:02:09.7133333+00:00

Hello Sina,

Thanks for you inputs. Let me try these things and get back.

Share via

Azure MySQL Flexible Server v8.0 switching to failover instance frequently stating health check failure

1 answer

Your answer