| (Added February 2022)
If you set your shutdown strategy to “Do not Switchover Resources” (default), do not start LifeKeeper immediately after stopping it. If the time between stopping and starting LifeKeeper is too short, a split brain may occur. This is especially important for Quorum configurations in storage mode.
| A split brain may occur if a resource fails while processing a restored comm path if quorum is configured
If a cluster is configured with quorum and both quickCheck and local recovery fail during processing of a restored communication path, it can result in a race condition between the hierarchy failover and quorum processing resulting in a split brain between the nodes.
| If quorum is configured and the active node fails or is rebooted with the Shutdown Strategy set to Switchover Resources, resources do switchover to the secondary node. However, when the original active node comes back up after the reboot, an attempt to switch the hierarchy back to the original active node is made, leaving the hierarchy in a failed / out of service state.
With large hierarchies, there is a race condition that occurs during the LifeKeeper restart on the original active node that results in an attempt to restore the hierarchy on the original primary node. This results in removing the hierarchy on new active node and resulting in the hierarchy not being in service on any node. Please contact Support for a patch for this issue.
| If there is a problem with a network connection, stop the service that automatically configures the network
In an environment where IP addresses are protected using LifeKeeper, IP resources may conflict with daemons and services that automatically configure the network, such as avahi-daemon. If there is a problem when restoring communication paths or starting IP resources, stop the services that automatically configure the network.
| Do not disconnect the network using the ifconfig down or the ip link down command
When a network interface is disconnected using the ifconfig down or ip link down command, a communication path may not be restored after reconnecting, if a virtual IP resource is configured on the interface.
| LifeKeeper does not start with systemd target set to multi-user
In order for LifeKeeper to function properly, when running systemctl set-default or systemctl isolate, you must use the lifekeeper-graphical.target (for graphical mode) or lifekeeper-multi-user.target (for console mode). Do not use the normal graphical.target and multi-user.target systemd targets.
|DataKeeper Disk UUID Restriction
Starting in version 9.5.0, DataKeeper can no longer mirror disks that do not present a UUID to the operating system. The best way to mirror such a disk is to partition it with a GPT (GUID Partition Table). The “parted” tool can be used for this purpose. Caution: partitioning a disk will destroy any data that is already stored on the disk.
Workaround: See DataKeeper for Linux Troubleshooting
| On SLES 15, LifeKeeper logging may not appear in the LifeKeeper log file following a log rotation
If logrotate is run on the command line or if a background log rotation occurs due to the size of the log, LifeKeeper will stop logging.
Workaround: Run systemctl reload rsyslog to resume LifeKeeper logging.
|When running lkbackup, an error may appear in the LifeKeeper log
lkbackup[30809: ERROR:lkbackup:::010064:Possible Configuration error: More than one LifeKeeper version is installed on this hostThis error message can safely be ignored.
| File system labels should not be used in large configurations
The use of file system labels can cause performance problems during boot-up with large clusters. The problems are generally the result of the requirement that to use labels all devices connected to a system must be scanned. For systems connected to a SAN, especially those with LifeKeeper where accessing a device is blocked, this scanning can be very slow.
To avoid this performance problem on Red Hat systems, edit /etc/fstab and replace the labels with the path names.
| lkscsid will halt the system when it should issue a sendevent when a disk fails in certain environments
When lkscsid detects a disk failure, it should, by default, issue a sendevent to LifeKeeper to recover from the failure. The sendevent will first try to recover the failure locally and if that fails, will try to recover the failure by switching the hierarchy with the disk to another server. On some versions of Linux (RHEL 5 and SLES11), lkscsid will not be able to issue the sendevent but instead will immediately halt the system. This only affects hierarchies using the SCSI device nodes such as /dev/sda in a shared storage configuration.
| DataKeeper Create Resource fails
When using DataKeeper in certain environments (e.g., virtualized environments with IDE disk emulation, or servers with HP CCISS storage), an error may occur when a mirror is created:
This is because LifeKeeper does not recognize the disk in question and cannot get a unique ID to associate with the device.
Workaround: Use a GUID Partition so that LifeKeeper can recognize the disk in question.
| Specifying hostnames for API access
The key name used to store LifeKeeper server credentials must match the hostname of the other LifeKeeper server exactly (as displayed by the hostname command on that server). If the hostname is an FQDN, then the credential key must also be the FQDN. If the hostname is a short name, then the key must also be the short name.
Workaround: Make sure that the hostname(s) stored by credstore match the hostname exactly.
| Restore of an lkbackup after a resource has been created may leave broken equivalencies
The configuration files for created resources are saved during an lkbackup. If a resource is created for the first time after an lkbackup has been taken, that resource may not be properly accounted for when restoring from this previous backup.
Solution: Restore from lkbackup prior to adding a new resource for the first time. If a new resource has been added after an lkbackup, it should either be deleted prior to performing the restore, or delete an instance of the resource hierarchy, then re-extend the hierarchy after the restore. Note: It is recommended that an lkbackup be run when a resource of a particular type is created for the first time.
|Resources removed in the wrong order during failover
In cases where a hierarchy shares a common resource instance with another root hierarchy, resources are sometimes removed in the wrong order during a cascading failover or resource failover.
Solution: Creating a common root will ensure that resource removals in the hierarchy occur from the top down.
Note: Using /bin/true for the restore and remove script would accomplish this.
| Delete of nested file system hierarchy generates “Object does not exist” message
Solution: This message can be disregarded as it does not create any issues.
| filesyshier returns the wrong tag on a nested mount create
When a database has nested file system resources, the file system kit will create the file system for both the parent and the nested child. However, filesyshier returns only the child tag. This causes the application to create a dependency on the child but not the parent.
Solution: When multiple file systems are nested within a single mount point, it may be necessary to manually create the additional dependencies to the parent application tag using dep_create or via the UI Create Dependency.
| DataKeeper: Nested file system create will fail with DataKeeper
When creating a DataKeeper mirror for replicating an existing file system, if a file system is nested within this structure, you must unmount it first before creating the File System resource.
Workaround: Manually unmount the nested file systems and remount / create each nested mount.
| Changing the mount point of the device protected by Filesystem resource may lead data corruption
The mount point of the device protected by LifeKeeper via the File System resource (filesys) must not be changed. Doing so may lead to the device being mounted on multiple nodes and if a switchover is done and this could lead to data corruption.
| XFS file system usage may cause quickCheck to fail.
In the case CHECK_FS_QUOTAS setting is enabled for LifeKeeper installed on Red Hat Enterprise Linux 7 / Oracle Linux 7 / CentOS 7, quickCheck fails if uquota, gquota option is set to the XFS file system resource, which is to be protected.
Solution: Use usrquota, grpquota instead of uquota, gquota for mount options of XFS file system, or, disable CHECK_FS_QUOTAS setting.
| Btrfs is not supported
Btrfs (or any other SPS for Linux unsupported filesystem) cannot be used for LifeKeeper files (/opt/LifeKeeper), bitmap files if they are not in /opt/LifeKeeper, lkbackupfiles, or any other LifeKeeper related files. In addition, LifeKeeper does not support protecting Btrfs (or any other SPS for Linux unsupported filesystem) within a resource hierarchy.
Solution: A simple work around for placing /optLifeKeeper on a Btrfs file system is to add a small disk to your instances and format that disk with ext4 or xfs, and mount this filesystem as /opt/LifeKeeper.
| SLES12 SP1 or later on AWS
The following restrictions apply with SLES12 SP1 or later on AWS:
• Cannot set static routing configuration
Solution: Update the routing information in the configuration file by modifying the “ROUTE” parameter in /etc/sysconfig/network/ifroute-ethX
• Hostname is changed even if the “Change Hostname via DHCP” setting is disabled.
| Shutdown Strategy set to “Switchover Resources” may fail when using Quorum/Witness Kit in Witness mode
Hierarchy switchover during LifeKeeper shutdown may fail to occur when using the Quorum/Witness Kit in Witness mode.
Workaround: Manually switchover resource hierarchies before shutdown.
| Edit /etc/services
If the following entry in /etc/service is deleted, LifeKeeper cannot start up.
Don’t delete this entry when editing the file.
|Any storage unit which returns a string including a space for the SCSI ID cannot be protected by LifeKeeper.|
| Using bind mounts is not supported
Bind mounts (mount —bind) cannot be used for the file system protected by LifeKeeper.
| On SLES running on AWS or Azure, change the network interface configuration file in order to prevent a cloud network plug-in from removing the virtual IP address.
Click here for more details.
| log ID 4739 – Switchover request failed when the core hierarchy reservation lock for the switchover was reset by a recover event for a mirror resync.
When the switchover completed successfully and the core attempted to clear the lock and found it owned by another process it failed the switchover and marked the root resource as failed even though it had restored successfully.
Solution/Workaround: Raise the value of the RESRVRECTIMEOUT tunable. If RESRVRECTIMEOUT was raised above 600 then the value of RESRVTIMEOUT should also be raised to match it.
Note: When setting the timeout for the tunables RESRVRECTIMEOUT and RESRVTIMEOUT the value will very depending on the number of resources as well as the type of each resource in the hierarchy. The value must include the time to remove each resource on the current source node plus the amount of time to restore each resource on the new source node. Resource restore and remove times can very from cluster to cluster and over time so to get an estimated value for these tunables follow the steps below.
If the “x” value of RESRVRECTIMEOUT > 600, then set RESRVTIMEOUT to x. [If x>600, then x=y]