The following should be considered before operating the LifeKeeper for Linux NAS Recovery Kit:
- Install the NAS Recovery Kit on the server(s) in your cluster configuration where you wish to mount your exported file systems and where you will extend your NAS resource hierarchy. You can export your file system from either a NFS server, which may be protected by LifeKeeper (this is the recommended configuration), or from a Network Attached Storage device.
- To ensure proper execution of this kit, it is highly recommended that you mount your exported NFS file system using the server’s IP address in place of the server name and that you perform your mount operation before you place your file system under LifeKeeper protection. Additionally, if you are mounting a file system that is currently protected by the LifeKeeper for Linux NFS Server Recovery Kit, we strongly suggest that the IP address used to create the NFS Server hierarchy be used to mount the file system on the LifeKeeper NAS server.
- To eliminate the possibility of split-brain related problems (i.e. more than one node in the cluster has a hierarchy In Service Protected (ISP)), we highly recommend that you establish one of the communication paths between nodes in the cluster on the same network used to access the exported file system. Failure to comply with this recommendation can result in multiple nodes bringing the hierarchy ISP (split-brain) when a communication path failure occurs. To recover from a split-brain scenario, take all but one of the ISP hierarchies out of service. This will ensure that only one node has access to the exported file system.
- The built-in file system recovery kit used to build NAS hierarchies cannot detect and remove processes not protected by LifeKeeper that are using the mounted file system in a fail over condition. Therefore, it is highly recommended that only LifeKeeper protected processes use the NAS protected file system.
- The LKNFSTIMEOUT tunable represents the timeout in seconds the NAS Recovery Kit will use when attempting to determine the status of a NFS mounted file system. The default value for this tunable is set to 2 minutes. The LKNFSSYSCALLTO tunable represents the timeout in seconds the NAS Recovery Kit will use for alarms to interrupt system calls when attempting to determine the status of a mount point. Use the formula below to determine the value for this tunable:
3 times your LKNFSSYSCALLTO value plus 5 should be less than the value of LKNFSTIMEOUT.
- The LKNASERROR tunable controls the actions the NAS Recovery kit takes when access to the NAS device fails. The tunable has two values, switch and halt, with switch being the default. If the value is set to switch and access fails, the NAS Recovery Kit will initiate a transfer of the resource hierarchy to a backup server when the failure is detected. The attempt to transfer the resource hierarchy to the backup server can hang if any of the resources sitting above the NAS resource attempt to access anything on the NAS file system. To avoid this problem the tunable value can be set to halt, which will immediately halt the system when an access failure is detected. This action will force a failover of all resource hierarchies to the backup server.
- STONITH devices or the Quorum/Witness package should be used so that a machine failure (all comm paths are down) does not result in a split brain where all the NAS resources are in service on all nodes in the cluster. This condition can lead to data corruption. More details on the Quorum/Witness package can be found in the LifeKeeper for Linux Technical Documentation.
Post your comment on this topic.