It may be necessary to tune the load balancer health check parameters in order to achieve the desired switchover behavior. There are three typical issues where you might need to tune parameters:
- The Healthy Threshold is set too high, the new active node is not marked healthy and traffic is not routed even after the switchover is complete.
- The Unhealthy Threshold is set too low, making the load balancer overly sensitive to temporary server resource limitations or network interruptions.
- The Unhealthy Threshold is set too high, the previous active node is not marked unhealthy and traffic continues to be routed even after the switchover is complete.
There are typically four health check parameters that may be tuned with a cloud load balancer:
- Health Check Interval
How often the health check servers send health check probes to the backend servers. - Timeout
How long a health check server will wait to receive a response before considering health check probes failed. - Healthy Threshold
The number of consecutive successful health check probes required for an unhealthy backend server to be marked as healthy. - Unhealthy Threshold
The number of consecutive health check probes that must fail in order for a backend server to be marked as unhealthy.
See the official documentation of each cloud platform for more details about these parameters.
From these parameters, we can derive the following values:
Time to Mark Healthy = Health Check Interval × (Healthy Threshold – 1)
The amount of time after the initial health check probe of a healthy server before it is marked healthy by the load balancer, assuming low network latency between the backend server and the health check serversMaximum Time to Mark Unhealthy = Health Check Interval × Unhealthy Threshold + Timeout
The maximum amount of time before a failed server is marked unhealthy by the load balancer
While the exact values for these parameters will vary depending on the environment, some general guidelines are given below.
If the combination of Health Check Interval and Healthy Threshold is set too high, the load balancer may take an unnecessarily long time to begin routing traffic to the active node after a switchover or failover, prolonging the recovery time for the application. Setting the value Healthy Threshold = 2 consecutive successes should be appropriate for most situations.
If the combination of Health Check Interval and Unhealthy Threshold is set too low, it may make the load balancer overly sensitive to temporary server resource limitations or network interruptions. For example, setting extreme values such as Timeout = 1 second and Unhealthy Threshold = 1 failure would cause a backend server to be marked unhealthy even if the network became unresponsive for only a few seconds, which is not uncommon in cloud environments. It is recommended to leave the Timeout at a reasonable value (e.g., 5 seconds) to allow the load balancer to address minor transient issues.
On the other hand, if the combination of Health Check Interval and Unhealthy Threshold is set too high, it will take the load balancer a long time to mark the previous active node unhealthy after a switchover, resulting in a period of time where load balancer traffic is being routed to both the active and standby servers. In order to avoid this situation, it is recommended to determine the minimum time between the LB Health Check resource being taken out-of-service on the previous active node and being brought in-service on the new active node.
Assuming that the time is synchronized between the cluster nodes, this can be found by inspecting /var/log/lifekeeper.log on each node and checking the amount of time between the end of the LB Health Check remove script on the previous active node and the end of the LB Health Check restore script on the new active node. We will denote this amount of time (in seconds) as Minimum LB Health Check Switchover Time. The recommendation then is to set the load balancer health check parameters such that:
Maximum Time to Mark Unhealthy <= Minimum LB Health Check Switchover Time + Time to Mark Healthy
Therefore, the Unhealthy Threshold can be derived as follows:
Unhealthy Threshold <= (Minimum LB Health Check Switchover Time - Timeout) / Health Check Interval + Healthy Threshold – 1
As an example, suppose that we have configured our load balancer health check parameters with Health Check Interval = Timeout = 5 seconds and Healthy Threshold = 2 consecutive successes. Suppose that by gathering data from repeated switchover tests with our resource hierarchy, we find that Minimum LB Health Check Switchover Time = 20 seconds. Using the recommendation given above, we should set Unhealthy Threshold so that:
Unhealthy Threshold <= (20 - 5) / 5 + 2 – 1 = 4
Based on this, any value from 2 to 4 could be a reasonable choice for Unhealthy Threshold. If we find when repeating switchover testing that the load balancer is not marking the previous active node as unhealthy quickly enough after switchover then we would select a lower value (e.g., 2 or 3). If we find during testing that the load balancer regularly marks the active node as unhealthy due to transient server or network issues, then we would select a higher value (e.g., 3 or 4). Choosing Unhealthy Threshold = 3 could be a reasonable compromise in this example.
Post your comment on this topic.