LifeKeeper provides a real-time daemon monitor, lkcheck, to check the status and health of LifeKeeper-protected resources. For each in-service resource, lkcheck periodically calls the quickCheck script for that resource type. The quickCheck script performs a quick health check of the resource, and if the resource is determined to be in a failed state, the quickCheck script calls the event notification mechanism, sendevent.
The following figure illustrates the recovery process tasks when lkcheck initiates the process:
- lkcheck runs. By default, the lkcheck process runs once every two minutes. When lkcheck runs, it invokes the appropriate quickCheck script for each in-service resource on the system.
- quickCheck script checks resource. The nature of the checks performed by the quickCheck script is unique to each resource type. Typically, the script simply verifies that the resource is available to perform its intended task by imitating a client of the resource and verifying that it receives the expected response.
- quickCheck script invokes sendevent. If the quickCheck script determines that the resource is in a failed state, it initiates an event of the appropriate class and type by calling sendevent.
- Recovery instruction search. The system event notification mechanism, sendevent, first attempts to determine if the LCD has a resource and/or recovery for the event type or component. To make this determination, the is_recoverable process scans the resource hierarchy in LCD for a resource instance that corresponds to the event (in this example, the filesys name).
The action in the next step depends upon whether the scan finds resource-level recovery instructions:
- Not found. If resource recovery instructions are not found, is_recoverable returns to sendevent and sendevent continues with basic event notification.
- Found. If the scan finds the resource, is_recoverable forks the recover process into the background. The is_recoverable process returns and sendevent continues with basic event notification, passing an advisory flag “-A” to the basic alarming event response scripts, indicating that LifeKeeper is performing recovery.
- Recover process initiated. Assuming that recovery continues, is_recoverable initiates the recover process which first attempts local recovery.
- Local recovery attempt. If the instance was found, the recover process attempts local recovery by accessing the resource hierarchy in LCD to search the hierarchy tree for a resource that knows how to respond to the event. For each resource type, it looks for a recovery subdirectory containing a subdirectory named for the event class, which in turn contains a recovery script for the event type.
The recover process runs the recovery script associated with the resource that is farthest above the failing resource in the resource hierarchy. If the recovery script succeeds, recovery halts. If the script fails, recover runs the script associated with the next resource, continuing until a recovery script succeeds or until recover attempts the recovery script associated with the failed instance.
If local recovery succeeds, the recovery process halts.
- Inter-server recovery begins. If local recovery fails, the event then escalates to inter-server recovery.
- Recovery continues. Since local recovery fails, the recover process marks the failed instance to the Out-of-Service-FAILED (OSF) state and marks all resources that depend upon the failed resource to the Out-of-Service-UNIMPAIRED (OSU) state. The recover process then determines whether the failing resource or a resource that depends upon the failing resource has any shared equivalencies with a resource on any other systems,and selects the one to the highest priority alive server. Only one equivalent resource can be active at a time.
If no equivalency exists, the recover process halts.
If a shared equivalency is found and selected, LifeKeeper initiates inter-server recovery. The recover process sends a message through the LCM to the LCD process on the selected backup system containing the shared equivalent resource. This means that LifeKeeper would attempt inter-server recovery.
- lcdrecover process coordinates transfer. The LCD process on the backup server forks the process lcdrecover to coordinate the transfer of the equivalent resource.
- Activation on backup server. The lcdrecover process finds the equivalent resource and determines whether it depends upon any resources that are not in-service. lcdrecover runs the restore script (part of the resource recovery action scripts) for each required resource, placing the resources in-service.
The act of restoring a resource on a backup server may result in the need for more shared resources to be transferred from the primary system. Messages pass to and from the primary system, indicating resources that need to be removed from service on the primary server and then brought into service on the selected backup server to provide full functionality of the critical applications. This activity continues, until no new shared resources are needed and all necessary resource instances on the backup are restored.