You are here: LifeKeeper > Troubleshooting > Common Causes of Failover

Common Causes of an SPS Initiated Failover

In the event of a failure, SPS has two methods of recovery: local recovery and inter-server recovery. If local recovery fails, a "failover" is implemented. A failover is defined as automatic switching to a backup server upon the failure or abnormal termination of the previously active application, server, system, hardware component or network. Failover and switchover are essentially the same operation, except that failover is automatic and usually operates without warning, while switchover requires human intervention. This automatic failover can occur for a number of reasons. Below is a list of the most common examples of an SPS initiated failover.

Server Level Causes

Server Failure

SPS has a built-in heartbeat signal that periodically notifies each server in the configuration that its paired server is operating. A failure is detected if a server fails to receive the heartbeat message.

Relevant Topics
Supported Storage List
Server Failure Recovery Scenario
Tuning the LifeKeeper Heartbeat
Quorum/Witness

 

Communication Failures/Network Failures

SPS sends the heartbeat between servers every five seconds. If a communication problem causes the heartbeat to skip two beats but it resumes on the third heartbeat, SPS takes no action. However, if the communication path remains dead for three beats, SPS will label that communication path as dead but will initiate a failover only if the redundant communication path is also dead.

Relevant Topics
Creating a Communication Path
Tuning the LifeKeeper Heartbeat
Network Configuration
Verifying Network Configuration
LifeKeeper Event Forwarding via SNMP
Network-Related Troubleshooting (GUI)
Running LifeKeeper With a Firewall
STONITH

 

Split-Brain

If a single comm path is used and the comm path fails, then SPS hierarchies may try to come into service on multiple systems simultaneously. This is known as a false failover or a “split-brain” scenario. In the “split-brain” scenario, each server believes it is in control of the application and thus may try to access and write data to the shared storage device. To resolve the split-brain scenario, SPS may cause servers to be powered off or rebooted or leave hierarchies out-of-service to assure data integrity on all shared data. Additionally, heavy network traffic on a TCP comm path can result in unexpected behavior, including false failovers and the failure of LifeKeeper to initialize properly.

The following are scenarios that can cause split-brain:

 

Resource Level Causes

SPS is designed to monitor individual applications and groups of related applications, periodically performing local recoveries or notifications when protected applications fail. Related applications, by example, are hierarchies where the primary application depends on lower-level storage or network resources. SPS monitors the status and health of these protected resources. If the resource is determined to be in a failed state, an attempt will be made to restore the resource or application on the current system (in-service node) without external intervention. If this local recovery fails, a resource failover will be initiated.

Application Failure

Examples of remove failures:

Relevant Topics
File System Health Monitoring
Resource Error Recovery Scenario

 

File System

Relevant Topics
File System Health Monitoring

 

IP Address Failure

When a failure of an IP address is detected by the IP Recovery Kit, the resulting failure triggers the execution of the IP local recovery script. SPS first attempts to bring the IP address back in service on the current network interface. If the local recovery attempt fails, SPS will perform a failover of the IP address and all dependent resources to a backup server. During failover, the remove process will un-configure the IP address on the current server so that it can be configured on the backup server. Failure of this remove process will cause the system to reboot.

Relevant Topics
Creating Switchable IP Address
IP Local Recovery

 

Reservation Conflict

Relevant Topics
SCSI Reservations
Disabling Reservations

 

SCSI Device

of

© 2018 SIOS Technology Corp., the industry's leading provider of business continuity solutions, data replication for continuous data protection.