Description |
DataKeeper synchronization fails with certain kernel versions.
The following logs are repeatedly output to /var/log/messages:
Apr 18 13:05:59 node1 nbd-client: Begin Negotiation
Apr 18 13:05:59 node1 nbd-client: size = 53684994048
Apr 18 13:05:59 node1 nbd-client: Negotiation Complete
Apr 18 13:05:59 node1 nbd-client: Ioctl/2 failed: Device or resource busy
Apr 18 13:05:59 node1 kernel: block nbd1: Device being setup by another task
If you are using RHEL8.0~8.2 or Oracle Linux UEK R5 (kernel-4.14.35-*), please update to RHEL8.3 or later and UEK R6 or later, respectively. |
Extending to the third node fails when a resync is in progress
With a DK resource (mirror), if you attempt to extend the mirror to a third node while a resync is in progress the extend to the third node will fail because the mirror can’t be “grown” while the resync is in progress.
It will result in an error similar to the following:
removing hierarchy remnants
getId: /opt/LifeKeeper/lkadm/subsys/scsi/CPQARRAY/bin/getId -i “/dev/sdb1” returned “”
getId: /opt/LifeKeeper/lkadm/subsys/scsi/device/bin/getId -i “/dev/sdb1” returned “360022480e2671ceb01246bb1c5d67ebd-1”
mdadm: /dev/md0 is performing resync/recovery and cannot be reshaped
Failed to grow array (1) |
When using DataKeeper on Oracle Linux 8.5 (RHCK), it is necessary to replace the HADR package when installing the LifeKeeper.
If you are using DataKeeper for Linux v9.6.1, perform the following steps on all nodes in the cluster. These steps are not needed if using UEK (Unbreakable Enterprise Kernel).
Uninstall the existing HADR package.
# rpm -e HADR-RHAS-4.18.0-all-9.4.0-6882.x86_64
Install the HADR package used for Oracle Linux 8.5. The following is an example where the LifeKeeper installation image (sps.img) is mounted on /mnt.
# rpm -i /mnt/RHAS/HADR-RHAS-4.18.0-348.el8.x86_64-9.6.1-7412.x86_64.rpm
Re-read the nbd module.
# modprobe -r nbd; modprobe nbd
Now a DataKeeper resource can be created and utilized. |
A DataKeeper resource configuration where the resource is created with asynchronous mode and extended with synchronous mode is not supported.
In the DataKeeper resource configuration where the resource is created with asynchronous mode and extended with synchronous mode, the read/write process for mirrors may hang within the kernel.
Run the following command on each node to determine if the DataKeeper resource is synchronous or asynchronous. 0 is synchronous mode and non zero is asynchronous mode. Resources with all synchronous or resources with all asynchronous on all nodes are acceptable. To avoid this issue do not mix synchronous and asynchronous modes.
perl -nle 'my @x = split(/\x01/, $_); print “$x[0]:$x[3]”;' /opt/LifeKeeper/subsys/scsi/resources/netraid/mirrorinfo_<md num>
Solution: Currently no workaround is available. Recreate a DataKeeper resource and select synchronous mode at the time of creating and extending. |
Partitions with an odd number of sectors are not supported when running kernel 4.12 or later
The use of a partition with an odd number of sectors is not supported in a DataKeeper mirror in environments running kernel 4.12 or later. This is due to an issue where a resync may fail when attempting to write past the end of the disk. |
Important reminder about DataKeeper for Linux asynchronous mode in an LVM over DataKeeper configuration
Kernel panics may occur in configurations were LVM resources sit above multiple asynchronous mirrors. In these configurations data consistency may be an issue if a panic occurs. Therefore the required configurations are a single DataKeeper mirror or multiple synchronous DataKeeper mirrors. |
In symmetric active SDR configurations with significant I/O traffic on both servers, the filesystem mounted on the mirror stops responding and eventually the whole system hangs
Due to the single threaded nature of the Linux buffer cache, the buffer cache flushing daemon can hang trying to flush out a buffer which needs to be committed remotely. While the flushing daemon is hung, all activities in the Linux system with dirty buffers will stop if the number of dirty buffers goes over the system accepted limit (set in /proc/sys/kernel/vm/bdflush).
Usually this is not a serious problem unless something happens to prevent the remote system from clearing remote buffers (e.g. a network failure). LifeKeeper will detect a network failure and stop replication in that event, thus clearing a hang condition. However, if the remote system is also replicating to the local system (i.e. they are both symmetrically replicating to each other), they can deadlock forever if they both get into this flushing daemon hang situation.
The deadlock can be released by manually killing the nbd-client daemons on both systems (which will break the mirrors). To avoid this potential deadlock entirely, however, symmetric active replication is not recommended. |
High CPU usage reported by top for md_raid1 process with large mirror sizes
With the mdX_raid1 process (with X representing the mirror number), high CPU usage as reported by top can be seen on some OS distributions when working with very large mirrors (500GB or more).
Solution: To reduce the CPU usage percent, modify the chunk size to 1024 via the LifeKeeper tunable LKDR_CHUNK_SIZE then delete and recreate the mirror in order to use this new setting. |
The use of lkbackup with DataKeeper resources requires a full resync
Although lkbackup will save the instance and mirror_info files, it is best practice to perform a full resync of DataKeeper mirrors after a restore from lkbackup as the status of source and target cannot be guaranteed while a resource does not exist. |
DataKeeper does not support using Network Compression on SLES12 SP1 or later
DataKeeper does not support using Network Compression on SLES12 SP1 or later due to disk I/O performance problem. |
Certain kernel versions do not support DataKeeper asynchronous mode.
It has been observed that kernel panic will occur with certain kernel versions when using DataKeeper resource asynchronous mode with LifeKeeper for Linux. Since this is a kernel dependent problem, there is no fundamental solution with LifeKeeper. In order to use DataKeeper asynchronous mode configuration, it is necessary to update or downgrade the kernel.
The kernel versions that do not support the DataKeeper asynchronous mode are as follows.
3.10.0-693. series for 3.10.0-693.24.1.el7.x86_64 or later
3.10.0-862.el7.x86_64 ~ 3.10.0-862.26.x.el7.x86_64
3.10.0-957.el7.x86_64 ~ 3.10.0-957.3.x.el7.x86_64
If you use the kernel version listed above and use DataKeeper resources in asynchronous mode, please update (or downgrade) to the following kernel version.
3.10.0-693. series kernel for before 3.10.0-693.24.1.el7.x86_64
3.10.0-862.29.1.el7.x86_64 or later
3.10.0-957.4.1.el7.x86_64 or later
If you cannot update (or downgrade) the kernel, do not use DataKeeper asynchronous mode. |
Some kernel versions do not support the Secure Boot feature
If Secure Boot is enabled on RHEL7 or later, CentOS7 or later, or Oracle Linux 7 or later, the nbd module fails to load. Also, in some kernel versions of SUSE Linux Enterprise Server and Oracle Linux UEK kernel, loading of the md / raid1 kernel module fails when Secure Boot is enabled.
Solution: Take one of the following actions:
- Disable Secure Boot – Disable Secure Boot in the UEFI configuration.
- Disable signature verification – Disable signature verification with the “mokutil ––disable-validation“ command. See mokutil documentations for details.
Solution 1 is recommended. Both require a system reboot.
|
Post your comment on this topic.