Common Causes of an SPS/LifeKeeper Initiated Failover

In the event of a failure, SPS has two methods of recovery: local recovery and inter-server recovery. If local recovery fails, a “failover” is implemented. A failover is defined as automatic switching to a backup server upon the failure or abnormal termination of the previously active application, server, system, hardware component or network. Failover and switchover are essentially the same operation, except that failover is automatic and usually operates without warning, while switchover requires human intervention. This automatic failover can occur for a number of reasons. Below is a list of the most common examples of an SPS initiated failover.

Server Level Causes

Server Failure

SPS has a built-in heartbeat signal that periodically notifies each server in the configuration that its paired server is operating. A failure is detected if a server fails to receive the heartbeat message.

Primary server loses power or is turned off.

CPU Usage caused by excessive load — Under very heavy I/O loads, delays and low memory conditions can cause system to become unresponsive such that SPS may detect a server as down and initiate a failover.

Quorum/Witness – As part of the I/O fencing mechanism of quorum/witness, when a primary server loses quorum, a “fastboot”:, “fastkill” or “osu” is performed (based on settings) and a failover is initiated. When determining when to fail over, the witness server allows resources to be brought in service on a backup server only in cases where it verifies the primary server has failed and is no longer part of the cluster. This will prevent failovers from happening due to simple communication failures between nodes when those failures don’t affect the overall access to, and performance of, the in-service node.

Relevant Topics
Supported Storage List
Server Failure Recovery Scenario
Tuning the LifeKeeper Heartbeat
Quorum/Witness

Communication Failures/Network Failures

SPS sends the heartbeat between servers every five seconds. If a communication problem causes the heartbeat to skip two beats but it resumes on the third heartbeat, SPS takes no action. However, if the communication path remains dead for three beats, SPS will label that communication path as dead but will initiate a failover only if the redundant communication path is also dead.

Network connection to the primary server is lost.

Network latency.

Heavy network traffic on a TCP comm path can result in unexpected behavior, including false failovers and LifeKeeper initialization problems.

Using STONITH, when SPS detects a communication failure with a node, that node will be powered off and a failover will occur.

Failed NIC.

Failed network switch.

Manually pulling/removing network connectivity.

Relevant Topics
Creating a Communication Path
Tuning the LifeKeeper Heartbeat
Network Configuration
Verifying Network Configuration
LifeKeeper Event Forwarding via SNMP
Network-Related Troubleshooting
Running LifeKeeper With a Firewall
STONITH

Split-Brain

If a single comm path is used and the comm path fails, then SPS hierarchies may try to come into service on multiple systems simultaneously. This is known as a false failover or a “split-brain” scenario. In the “split-brain” scenario, each server believes it is in control of the application and thus may try to access and write data to the shared storage device. To resolve the split-brain scenario, SPS may cause servers to be powered off or rebooted or leave hierarchies out-of-service to assure data integrity on all shared data. Additionally, heavy network traffic on a TCP comm path can result in unexpected behavior, including false failovers and the failure of LifeKeeper to initialize properly.

The following are scenarios that can cause split-brain:

Any of the comm failures listed above

Improper shutdown of LifeKeeper

Server resource starvation

Losing all network paths

DNS or other network glitch

System lockup/thaw

Resource Level Causes

SPS is designed to monitor individual applications and groups of related applications, periodically performing local recoveries or notifications when protected applications fail. Related applications, by example, are hierarchies where the primary application depends on lower-level storage or network resources. SPS monitors the status and health of these protected resources. If the resource is determined to be in a failed state, an attempt will be made to restore the resource or application on the current system (in-service node) without external intervention. If this local recovery fails, a resource failover will be initiated.

Application Failure

An application failure is detected, but the local recovery process fails.

Remove Failure – During the resource failover process, certain resources need to be removed from service on the primary server and then brought into service on the selected backup server to provide full functionality of the critical applications. If this remove process fails, a reboot of the primary server will be performed resulting in a complete server failover.

Examples of remove failures:

Unable to unmount file system

Unable to shut down protected application (oracle, mysql, postgres, etc)

Relevant Topics
File System Health Monitoring
Resource Error Recovery Scenario

File System

Disk Full — SPS’s File System Health Monitoring can detect disk full file system conditions which may result in failover of the file system resource.
Unmounted or Improperly Mounted File System — User manually unmounts or changes options on an in-service and LK protected file system.
Remount Failure — The following is a list of common causes for remount failure which would lead to a failover:

corrupted file system (fsck failure)

failure to create mount point directory

mount point is busy

mount failure

SPS internal error

Relevant Topics
File System Health Monitoring

IP Address Failure

When a failure of an IP address is detected by the IP Recovery Kit, the resulting failure triggers the execution of the IP local recovery script. SPS first attempts to bring the IP address back in service on the current network interface. If the local recovery attempt fails, SPS will perform a failover of the IP address and all dependent resources to a backup server. During failover, the remove process will un-configure the IP address on the current server so that it can be configured on the backup server. Failure of this remove process will cause the system to reboot.

IP conflict

IP collision

DNS resolution failure

NIC or Switch Failures

Relevant Topics
Creating Switchable IP Address
IP Local Recovery

Reservation Conflict

A reservation to a protected device is lost or stolen

Unable to regain reservation or control of a protected resource device (caused by manual user intervention, HBA or switch failure)

Relevant Topics
SCSI Reservations
Disabling Reservations

SCSI Device

Protected SCSI device could not be opened. The device may be failing or may have been removed from the system.

Solutions

Known Issues and Restrictions

Feedback

Post your comment on this topic.