The LifeKeeper Communications Manager (LCM) has two functions:
- Messaging. The LCM serves as a conduit through which LifeKeeper sends messages during recovery, configuration, or when running an audit.
- Failure detection. The LCM also plays a role in detecting whether or not a server has failed.
LifeKeeper has a built-in heartbeat signal that periodically notifies each server in the configuration that its paired server is operating. If a server fails to receive the heartbeat message through one of the communications paths, LifeKeeper marks that path DEAD.
The following figure illustrates the recovery tasks when the LCM heartbeat mechanism detects a server failure.
The following steps describe the recovery scenario, illustrated above, if LifeKeeper marks all communications connections to a server DEAD.
- LCM activates eventslcm. When LifeKeeper marks all communications paths dead, the LCM initiates the eventslcm process.
Only one activity stops the eventslcm process:
- Communication path alive. If one of the communications paths begins sending the heartbeat signal again, the LCM stops the eventslcm process.
- Communication path alive. If one of the communications paths begins sending the heartbeat signal again, the LCM stops the eventslcm process.
It is critical that you configure two or more physically independent, redundant communication paths between each pair of servers to prevent failovers and possible system panics due to communication failures.
- Message to sendevent. eventslcm sends the system failure alarm by calling sendevent with the event type machfail.
- sendevent initiates failover recovery. The sendevent program determines that LifeKeeper can handle the system failure event and executes the LifeKeeper failover recovery process lcdmachfail.
- lcdmachfail checks. The lcdmachfail process first checks to ensure that the non-responding server was not shut down. Failovers are inhibited if the other system was shut down gracefully before system failure. Then lcdmachfail determines all resources that have a shared equivalency with the failed system. This is the commit point for the recovery.
- lcdmachfail restores resources. lcdmachfail determines all resources on the backup server that have shared equivalencies with the failed primary server. It also determines whether the backup server is the highest priority alive server for which a given resource is configured. All backup servers perform this check, so that only one server will attempt to recover a given hierarchy. For each equivalent resource that passes this check, lcdmachfail invokes the associated restore program. Then, lcdmachfail also restores each resource dependent on a restored resource, until it brings the entire hierarchy into service on the backup server.
Post your comment on this topic.