Description
DataKeeper synchronization fails with certain kernel versions.

The following logs are repeatedly output to /var/log/messages:
Apr 18 13:05:59 node1 nbd-client: Begin Negotiation
Apr 18 13:05:59 node1 nbd-client: size = 53684994048
Apr 18 13:05:59 node1 nbd-client: Negotiation Complete
Apr 18 13:05:59 node1 nbd-client: Ioctl/2 failed: Device or resource busy
Apr 18 13:05:59 node1 kernel: block nbd1: Device being setup by another task

If you are using RHEL8.0~8.2 or Oracle Linux UEK R5 (kernel-4.14.35-*), please update to RHEL8.3 or later and UEK R6 or later, respectively.
Extending to the third node fails when a resync is in progress

With a DK resource (mirror), if you attempt to extend the mirror to a third node while a resync is in progress the extend to the third node will fail because the mirror can’t be “grown” while the resync is in progress.

It will result in an error similar to the following:

removing hierarchy remnants
getId: /opt/LifeKeeper/lkadm/subsys/scsi/CPQARRAY/bin/getId -i “/dev/sdb1” returned “”
getId: /opt/LifeKeeper/lkadm/subsys/scsi/device/bin/getId -i “/dev/sdb1” returned “360022480e2671ceb01246bb1c5d67ebd-1”
mdadm: /dev/md0 is performing resync/recovery and cannot be reshaped
Failed to grow array (1)
RHEL 8.6 / 9.0 does not support DataKeeper asynchronous mode on disks that are thin provisioned

It has been observed that a kernel panic will occur with kernels provided for Red Hat EL 8.6 /9.0. While we are working with Red Hat to provide an updated kernel with the upstream fix we recommend not configuring asynchronous mirrors on a RHEL 8.6 / 9.0 kernel where the disk is thin provisioned. A warning will be displayed when installing or updating LifeKeeper on a RHEL 8.6 / 9.0 system.
A DataKeeper resource configuration where the resource is created with asynchronous mode and extended with synchronous mode is not supported.

In the DataKeeper resource configuration where the resource is created with asynchronous mode and extended with synchronous mode, the read/write process for mirrors may hang within the kernel.

Run the following command on each node to determine if the DataKeeper resource is synchronous or asynchronous. 0 is synchronous mode and non zero is asynchronous mode. Resources with all synchronous or resources with all asynchronous on all nodes are acceptable. To avoid this issue do not mix synchronous and asynchronous modes.

perl -nle 'my @x = split(/\x01/, $_); print “$x[0]:$x[3]”;' /opt/LifeKeeper/subsys/scsi/resources/netraid/mirrorinfo_<md num>

Solution: Currently no workaround is available. Recreate a DataKeeper resource and select synchronous mode at the time of creating and extending.
Partitions with an odd number of sectors are not supported when running kernel 4.12 or later

The use of a partition with an odd number of sectors is not supported in a DataKeeper mirror in environments running kernel 4.12 or later. This is due to an issue where a resync may fail when attempting to write past the end of the disk.
Important reminder about DataKeeper for Linux asynchronous mode in an LVM over DataKeeper configuration
Kernel panics may occur in configurations were LVM resources sit above multiple asynchronous mirrors. In these configurations data consistency may be an issue if a panic occurs. Therefore the required configurations are a single DataKeeper mirror or multiple synchronous DataKeeper mirrors.
In symmetric active SDR configurations with significant I/O traffic on both servers, the filesystem mounted on the mirror stops responding and eventually the whole system hangs

Due to the single threaded nature of the Linux buffer cache, the buffer cache flushing daemon can hang trying to flush out a buffer which needs to be committed remotely. While the flushing daemon is hung, all activities in the Linux system with dirty buffers will stop if the number of dirty buffers goes over the system accepted limit (set in /proc/sys/kernel/vm/bdflush).

Usually this is not a serious problem unless something happens to prevent the remote system from clearing remote buffers (e.g. a network failure). LifeKeeper will detect a network failure and stop replication in that event, thus clearing a hang condition. However, if the remote system is also replicating to the local system (i.e. they are both symmetrically replicating to each other), they can deadlock forever if they both get into this flushing daemon hang situation.

The deadlock can be released by manually killing the nbd-client daemons on both systems (which will break the mirrors). To avoid this potential deadlock entirely, however, symmetric active replication is not recommended.
High CPU usage reported by top for md_raid1 process with large mirror sizes

With the mdX_raid1 process (with X representing the mirror number), high CPU usage as reported by top can be seen on some OS distributions when working with very large mirrors (500GB or more).

Solution: To reduce the CPU usage percent, modify the chunk size to 1024 via the LifeKeeper tunable LKDR_CHUNK_SIZE then delete and recreate the mirror in order to use this new setting.
The use of lkbackup with DataKeeper resources requires a full resync

Although lkbackup will save the instance and mirror_info files, it is best practice to perform a full resync of DataKeeper mirrors after a restore from lkbackup as the status of source and target cannot be guaranteed while a resource does not exist.
DataKeeper does not support using Network Compression on SLES12 SP1 or later

DataKeeper does not support using Network Compression on SLES12 SP1 or later due to disk I/O performance problem.
Certain kernel versions do not support DataKeeper asynchronous mode.

It has been observed that kernel panic will occur with certain kernel versions when using DataKeeper resource asynchronous mode with LifeKeeper for Linux. Since this is a kernel dependent problem, there is no fundamental solution with LifeKeeper. In order to use DataKeeper asynchronous mode configuration, it is necessary to update or downgrade the kernel.

The kernel versions that do not support the DataKeeper asynchronous mode are as follows.
3.10.0-693. series for 3.10.0-693.24.1.el7.x86_64 or later
3.10.0-862.el7.x86_64 ~ 3.10.0-862.26.x.el7.x86_64
3.10.0-957.el7.x86_64 ~ 3.10.0-957.3.x.el7.x86_64

If you use the kernel version listed above and want to use DataKeeper resources in asynchronous mode, please update (or downgrade) to the following kernel version.
3.10.0-693. series kernel for before 3.10.0-693.24.1.el7.x86_64
3.10.0-862.29.1.el7.x86_64 or later
3.10.0-957.4.1.el7.x86_64 or later

If you cannot update (or downgrade) the kernel, do not use DataKeeper asynchronous mode.
Some kernel versions do not support the Secure Boot feature

If Secure Boot is enabled on RHEL7, RHEL8, and their compatible operating systems, the nbd module fails to load. Also, in some kernel versions of SUSE Linux Enterprise Server and Oracle Linux UEK kernel, loading of the md / raid1 kernel module fails when Secure Boot is enabled.

Solution: Take one of the following actions:
  1. Disable Secure Boot – Disable Secure Boot in the UEFI configuration.
  2. Disable signature verification – Disable signature verification with the “mokutil ––disable-validation“ command. See mokutil documentations for details.

    Solution 1 is recommended. Both require a system reboot.

Mirroring occasionally stops when the communication paths recover

When LifeKeeper recovers from a communication failure and the resync starts, multiple processes for the resync run. Depending on the timing, one of the processes blocks the initialization of the other and the resync falls into an Out of Sync state.

This issue is most likely to occur when all of the following conditions are satisfied after recovering from a communication failure.

  • The mirror state of a DataKeeper resource is “Out of Sync”
  • The following log message is displayed at quickCheck intervals
    INFO:lkcheck:::006041:recover is being in progress: [Tag name of the DataKeeper resource]
  • One or more recover processes are still running. For example, you can check this by the following command.
    top | grep recover

Solution :

Reboot the node on which the DataKeeper resource is In Service.

If this solution is not effective, please contact our support desk.

Feedback

Was this helpful?

Yes No
You indicated this topic was not helpful to you ...
Could you please leave a comment telling us why? Thank you!
Thanks for your feedback.

Post your comment on this topic.

Post Comment