In the event of a failure, SPS has two methods of recovery: local recovery and inter-server recovery. If local recovery fails, a “failover” is implemented. A failover is defined as automatic switching to a backup server upon the failure or abnormal termination of the previously active application, server, system, hardware component or network. Failover and switchover are essentially the same operation, except that failover is automatic and usually operates without warning, while switchover requires human intervention. This automatic failover can occur for a number of reasons. Below is a list of the most common examples of an SPS initiated failover.
Server Level Causes
SPS has a built-in heartbeat signal that periodically notifies each server in the configuration that its paired server is operating. A failure is detected if a server fails to receive the heartbeat message.
- Primary server loses power or is turned off.
- CPU Usage caused by excessive load — Under very heavy I/O loads, delays and low memory conditions can cause system to become unresponsive such that SPS may detect a server as down and initiate a failover.
- Quorum/Witness – As part of the I/O fencing mechanism of quorum/witness, when a primary server loses quorum, a “fastboot”:, “fastkill” or “osu” is performed (based on settings) and a failover is initiated. When determining when to fail over, the witness server allows resources to be brought in service on a backup server only in cases where it verifies the primary server has failed and is no longer part of the cluster. This will prevent failovers from happening due to simple communication failures between nodes when those failures don’t affect the overall access to, and performance of, the in-service node.
|Supported Storage List|
|Server Failure Recovery Scenario|
|Tuning the LifeKeeper Heartbeat|
Communication Failures/Network Failures
SPS sends the heartbeat between servers every five seconds. If a communication problem causes the heartbeat to skip two beats but it resumes on the third heartbeat, SPS takes no action. However, if the communication path remains dead for three beats, SPS will label that communication path as dead but will initiate a failover only if the redundant communication path is also dead.
- Network connection to the primary server is lost.
- Network latency.
- Heavy network traffic on a TCP comm path can result in unexpected behavior, including false failovers and LifeKeeper initialization problems.
- Using STONITH, when SPS detects a communication failure with a node, that node will be powered off and a failover will occur.
- Failed NIC.
- Failed network switch.
- Manually pulling/removing network connectivity.
|Creating a Communication Path|
|Tuning the LifeKeeper Heartbeat|
|Verifying Network Configuration|
|LifeKeeper Event Forwarding via SNMP|
|Running LifeKeeper With a Firewall|
If a single comm path is used and the comm path fails, then SPS hierarchies may try to come into service on multiple systems simultaneously. This is known as a false failover or a “split-brain” scenario. In the “split-brain” scenario, each server believes it is in control of the application and thus may try to access and write data to the shared storage device. To resolve the split-brain scenario, SPS may cause servers to be powered off or rebooted or leave hierarchies out-of-service to assure data integrity on all shared data. Additionally, heavy network traffic on a TCP comm path can result in unexpected behavior, including false failovers and the failure of LifeKeeper to initialize properly.
The following are scenarios that can cause split-brain:
- Any of the comm failures listed above
- Improper shutdown of LifeKeeper
- Server resource starvation
- Losing all network paths
- DNS or other network glitch
- System lockup/thaw
Resource Level Causes
SPS is designed to monitor individual applications and groups of related applications, periodically performing local recoveries or notifications when protected applications fail. Related applications, by example, are hierarchies where the primary application depends on lower-level storage or network resources. SPS monitors the status and health of these protected resources. If the resource is determined to be in a failed state, an attempt will be made to restore the resource or application on the current system (in-service node) without external intervention. If this local recovery fails, a resource failover will be initiated.
- An application failure is detected, but the local recovery process fails.
- Remove Failure – During the resource failover process, certain resources need to be removed from service on the primary server and then brought into service on the selected backup server to provide full functionality of the critical applications. If this remove process fails, a reboot of the primary server will be performed resulting in a complete server failover.
Examples of remove failures:
- Unable to unmount file system
- Unable to shut down protected application (oracle, mysql, postgres, etc)
|File System Health Monitoring|
|Resource Error Recovery Scenario|
- Disk Full — SPS’s File System Health Monitoring can detect disk full file system conditions which may result in failover of the file system resource.
- Unmounted or Improperly Mounted File System — User manually unmounts or changes options on an in-service and LK protected file system.
- Remount Failure — The following is a list of common causes for remount failure which would lead to a failover:
- corrupted file system (fsck failure)
- failure to create mount point directory
- mount point is busy
- mount failure
- SPS internal error
|File System Health Monitoring|
IP Address Failure
When a failure of an IP address is detected by the IP Recovery Kit, the resulting failure triggers the execution of the IP local recovery script. SPS first attempts to bring the IP address back in service on the current network interface. If the local recovery attempt fails, SPS will perform a failover of the IP address and all dependent resources to a backup server. During failover, the remove process will un-configure the IP address on the current server so that it can be configured on the backup server. Failure of this remove process will cause the system to reboot.
- IP conflict
- IP collision
- DNS resolution failure
- NIC or Switch Failures
|Creating Switchable IP Address|
|IP Local Recovery|
- A reservation to a protected device is lost or stolen
- Unable to regain reservation or control of a protected resource device (caused by manual user intervention, HBA or switch failure)
- Protected SCSI device could not be opened. The device may be failing or may have been removed from the system.