Overview of the Tunable Heartbeat
The LifeKeeper heartbeat is the signal sent between LifeKeeper servers over the communications path(s) to ensure each server is “alive”. There are two aspects of the heartbeat that determine how quickly LifeKeeper detects a failure:
- Interval: the time interval between heartbeats signal sent (unit is second). Failing to receive the LCM signal, which includes heartbeat signal, from another server within the interval time is determined as a missed heartbeat.
- Number of Heartbeats: the consecutive number of heartbeats by which the communications path is determined as dead, triggering a failover.
The heartbeat values are specified by two tunables in the LifeKeeper defaults file /etc/default/LifeKeeper. These tunables can be changed if you wish LifeKeeper to detect a server failure sooner than it would using the default values:
- LCMHBEATTIME (interval)
- LCMNUMHBEATS (number of heartbeats)
The following table summarizes the defaults and minimum values for the tunables over both TCP and TTY heartbeats. The interval for a TTY communications path cannot be set below 2 seconds because of the slower nature of the medium.
||1 (TCP) 2 (TTY)|
||2 (TCP or TTY)|
Consider a LifeKeeper cluster in which both intervals are set to the default values. LifeKeeper sends a heartbeat between servers every 5 seconds. If a communications problem causes the heartbeat to skip two beats, but it resumes on third heartbeat, LifeKeeper takes no action. However, if the communications path remains dead for 3 beats, LifeKeeper will label that communications path as dead, but will initiate a failover only if the redundant communications path is also dead.
Configuring the Heartbeat
You must manually edit file /etc/default/LifeKeeper to add the tunable and its associated value. Normally, the defaults file contains no entry for these tunables; you simply append the following lines with the desired value as follows:
If you assign the value to a number below the minimum value, LifeKeeper will ignore that value and use the minimum value instead.
- If you wish to set the interval at less than 5 seconds, then you should ensure that the communications path is configured on a private network, since values lower than 5 seconds create a high risk of false failovers due to network interruptions. The servers’ performance should be able to respond quickly to LCM requests even under the heaviest expected load.
- Testing has shown that setting the number of heartbeats to less than 2 creates a high risk of false failovers. This is why the value has been restricted to 2 or higher.
- In order to avoid false failovers, both the interval and heartbeat count values must be the same on all servers in the cluster. For this reason, LifeKeeper must be stopped on both servers before modifying these values. After starting LifeKeeper, you can use the command /opt/LifeKeeper/bin/lkstop -f to edit the heartbeat settings while the application is protected. This command stops LifeKeeper but does not stop the protected application.
- LifeKeeper does not impose an upper limit for the LCMHBEATTIME and LCMNUMHBEATS values. But setting these values at a very high number can effectively disable LifeKeeper’s ability to detect a failure. For instance, setting both values to 25 would instruct LifeKeeper to wait 625 seconds (over 10 minutes) to detect a server failure, which may be enough time for the server to re-boot and re-join the cluster.
For example, suppose you specify the lowest values allowed by LifeKeeper in order to detect failure as quickly as possible:
LifeKeeper will use a 1 second interval for the TCP communications path, and a 2 second interval for TTY. In the case of a server failure, LifeKeeper will detect the TCP failure first because its interval is shorter (2 heartbeats that are 1 second apart), but then will do nothing until it detects the TTY failure, which will be after 2 heartbeats that are 2 seconds apart.