In the event of a failure, SPS has two methods of recovery: local recovery and inter-server recovery. If local recovery fails, a "failover" is implemented. A failover is defined as automatic switching to a backup server upon the failure or abnormal termination of the previously active application, server, system, hardware component or network. Failover and switchover are essentially the same operation, except that failover is automatic and usually operates without warning, while switchover requires human intervention. This automatic failover can occur for a number of reasons. Below is a list of the most common examples of an SPS initiated failover.
SPS has a built-in heartbeat signal that periodically notifies each server in the configuration that its paired server is operating. A failure is detected if a server fails to receive the heartbeat message.
Relevant Topics |
---|
Supported Storage List |
Server Failure Recovery Scenario |
Tuning the LifeKeeper Heartbeat |
Quorum/Witness |
SPS sends the heartbeat between servers every five seconds. If a communication problem causes the heartbeat to skip two beats but it resumes on the third heartbeat, SPS takes no action. However, if the communication path remains dead for three beats, SPS will label that communication path as dead but will initiate a failover only if the redundant communication path is also dead.
If a single comm path is used and the comm path fails, then SPS hierarchies may try to come into service on multiple systems simultaneously. This is known as a false failover or a “split-brain” scenario. In the “split-brain” scenario, each server believes it is in control of the application and thus may try to access and write data to the shared storage device. To resolve the split-brain scenario, SPS may cause servers to be powered off or rebooted or leave hierarchies out-of-service to assure data integrity on all shared data. Additionally, heavy network traffic on a TCP comm path can result in unexpected behavior, including false failovers and the failure of LifeKeeper to initialize properly.
The following are scenarios that can cause split-brain:
SPS is designed to monitor individual applications and groups of related applications, periodically performing local recoveries or notifications when protected applications fail. Related applications, by example, are hierarchies where the primary application depends on lower-level storage or network resources. SPS monitors the status and health of these protected resources. If the resource is determined to be in a failed state, an attempt will be made to restore the resource or application on the current system (in-service node) without external intervention. If this local recovery fails, a resource failover will be initiated.
Examples of remove failures:
- Unable to unmount file system
- Unable to shut down protected application (oracle, mysql, postgres, etc)
Relevant Topics |
---|
File System Health Monitoring |
Resource Error Recovery Scenario |
- corrupted file system (fsck failure)
- failure to create mount point directory
- mount point is busy
- mount failure
- SPS internal error
Relevant Topics |
---|
File System Health Monitoring |
When a failure of an IP address is detected by the IP Recovery Kit, the resulting failure triggers the execution of the IP local recovery script. SPS first attempts to bring the IP address back in service on the current network interface. If the local recovery attempt fails, SPS will perform a failover of the IP address and all dependent resources to a backup server. During failover, the remove process will un-configure the IP address on the current server so that it can be configured on the backup server. Failure of this remove process will cause the system to reboot.
Relevant Topics |
---|
Creating Switchable IP Address |
IP Local Recovery |
Relevant Topics |
---|
SCSI Reservations |
Disabling Reservations |
© 2016 SIOS Technology Corp., the industry's leading provider of business continuity solutions, data replication for continuous data protection.