A failover time profile refers to a specific combination of failover parameters that determine the time in which failover should be completed and define the aggressiveness of failover. Some failover parameters include failover_timeout_sec
and failover_reader_connect_timeout_sec
. Failover should be completed within 5 minutes by default. If the connection is not re-established during this time, then the failover process times out and fails. Users can configure the failover parameters to adjust the aggressiveness of the failover and fulfill the needs of their specific application. For example, a user could take a more aggressive approach and shorten the time limit on failover to promote a fail-fast approach for an application that does not tolerate database outages. Examples of normal and aggressive failover time profiles are shown below.
Parameter | Value |
---|---|
failover_timeout_sec |
300 |
failover_writer_reconnect_interval_sec |
2 |
failover_reader_connect_timeout_sec |
30 |
failover_cluster_topology_refresh_rate_sec |
2 |
Parameter | Value |
---|---|
failover_timeout_sec |
30 |
failover_writer_reconnect_interval_sec |
2 |
failover_reader_connect_timeout_sec |
10 |
failover_cluster_topology_refresh_rate_sec |
2 |
Connecting to a writer cluster endpoint after failover can result in a faulty connection because there can be a delay before the endpoint is updated to point to the new writer. On the AWS DNS server, this change is usually updated after 15-20 seconds, but the other DNS servers sitting between the application and the AWS DNS server may take longer to update. Using the stale DNS data will most likely cause problems for users, so it is important to keep this in mind.
The failover process has limited advantages for a 2-host cluster because there are not as many instances available to replace the instance that has failed. In particular, when a reader instance fails, there are no other readers to fail over to. Instead, Aurora must revive the same instance that has failed. To improve the stability of the cluster, we recommend that your database cluster has at least 3 instances.
A common misconception about failover is the expectation that only one host will be unavailable during the failover process; this is actually not true. When failover is triggered, all hosts become unavailable for a short time. This is because the control plane, which orchestrates the failover process, first shuts down all hosts, then starts the writer host, and finally starts and connects the remaining hosts to the writer. In short, failover requires each host to be reconfigured and thus, all hosts become unavailable for a short period of time. With this in mind, please note that aggressive failover configurations may cause failover to fail because some hosts may still be unavailable when your failover timeout setting is reached.
If you are experiencing difficulties with the failover plugin, try the following:
- Enable logging to find the cause of the failure. If it is a timeout, review the failover time profiles section and adjust the timeout values.
- For additional assistance, visit the getting help page.