DataKeeper synchronization fails with certain kernel versions.

The following logs are repeatedly output to /var/log/messages:
Apr 18 13:05:59 node1 nbd-client: Begin Negotiation
Apr 18 13:05:59 node1 nbd-client: size = 53684994048
Apr 18 13:05:59 node1 nbd-client: Negotiation Complete
Apr 18 13:05:59 node1 nbd-client: Ioctl/2 failed: Device or resource busy
Apr 18 13:05:59 node1 kernel: block nbd1: Device being setup by another task

If you are using RHEL8.0~8.2 or Oracle Linux UEK R5 (kernel-4.14.35-*), please update to RHEL8.3 or later and UEK R6 or later, respectively.
Partitions with an odd number of sectors are not supported when running kernel 4.12 or later

The use of a partition with an odd number of sectors is not supported in a DataKeeper mirror in environments running kernel 4.12 or later. This is due to an issue where a resync may fail when attempting to write past the end of the disk.
Important reminder about DataKeeper for Linux asynchronous mode in an LVM over DataKeeper configuration

Kernel panics may occur in configurations were LVM resources sit above multiple asynchronous mirrors. In these configurations data consistency may be an issue if a panic occurs. Therefore the required configurations are a single DataKeeper mirror or multiple synchronous DataKeeper mirrors.
In symmetric active SDR configurations with significant I/O traffic on both servers, the filesystem mounted on the mirror stops responding and eventually the whole system hangs

Due to the single threaded nature of the Linux buffer cache, the buffer cache flushing daemon can hang trying to flush out a buffer which needs to be committed remotely. While the flushing daemon is hung, all activities in the Linux system with dirty buffers will stop if the number of dirty buffers goes over the system accepted limit (set in /proc/sys/kernel/vm/bdflush).

Usually this is not a serious problem unless something happens to prevent the remote system from clearing remote buffers (e.g. a network failure). LifeKeeper will detect a network failure and stop replication in that event, thus clearing a hang condition. However, if the remote system is also replicating to the local system (i.e. they are both symmetrically replicating to each other), they can deadlock forever if they both get into this flushing daemon hang situation.

The deadlock can be released by manually killing the nbd-client daemons on both systems (which will break the mirrors). To avoid this potential deadlock entirely, however, symmetric active replication is not recommended.
Mirror breaks and fills up /var/log/messages with errors

This issue has been seen occasionally (on Red Hat EL 6.x and CentOS 6.x) during stress tests with induced failures, especially in killing the nbd-server process that runs on a mirror target system. Upgrading to the latest kernel for your distribution may help lower the risk of seeing this particular issue, such as kernel-2.6.32-131.17.1 or later. Rebooting the source system will clear up this issue.

With the default kernel that comes with CentOS 6 (2.6.32-71), this issue may occur much more frequently (even when the mirror is just under a heavy load). Note: Beginning with SPS 8.1, when performing a kernel upgrade on Red Hat Enterprise Linux systems, it is no longer a requirement that the setup script (./setup) from the installation image be rerun. Modules should be automatically available to the upgraded kernel without any intervention as long as the kernel was installed from a proper Red Hat package (rpm file).
High CPU usage reported by top for md_raid1 process with large mirror sizes

With the mdX_raid1 process (with X representing the mirror number), high CPU usage as reported by top can be seen on some OS distributions when working with very large mirrors (500GB or more).

Solution: To reduce the CPU usage percent, modify the chunk size to 1024 via the LifeKeeper tunable LKDR_CHUNK_SIZE then delete and recreate the mirror in order to use this new setting.
The use of lkbackup with DataKeeper resources requires a full resync

Although lkbackup will save the instance and mirror_info files, it is best practice to perform a full resync of DataKeeper mirrors after a restore from lkbackup as the status of source and target cannot be guaranteed while a resource does not exist.
Mirror resyncs may hang in early Red Hat/CentOS 6.x kernels with a “Failed to remove device” message in the LifeKeeper log

Kernel versions prior to version 2.6.32-131.17.1 (RHEL 6.1 kernel version 2.6.32-131.0.15 before update, etc) contain a problem in the md driver used for replication. This problem prevents the release of the nbd device from the mirror resulting in the logging of multiple “Failed to remove device” messages and the aborting of the mirror resync. A system reboot may be required to clear the condition. This problem has been observed during initial resyncs after mirror creation and when the mirror is under stress. Solution: Kernel 2.6.32-131.17.1 has been verified to contain the fix for this problem. If you are using DataKeeper with Red Hat or CentOS 6 kernels before the 2.6.32-131.17.1 version, we recommend updating to this or the latest available version.
DataKeeper does not support using Network Compression on SLES11 SP4 and SLES12 SP1 or later

DataKeeper does not support using Network Compression on SLES11 SP4 and SLES12 SP1 or later due to disk I/O performance problem.
Certain kernel versions do not support DataKeeper asynchronous mode.

It has been observed that kernel panic will occur with certain kernel versions when using DataKeeper resource asynchronous mode with LifeKeeper for Linux. Since this is a kernel dependent problem, there is no fundamental solution with LifeKeeper. In order to use DataKeeper asynchronous mode configuration, it is necessary to update or downgrade the kernel.

The kernel versions that do not support the DataKeeper asynchronous mode are as follows.
3.10.0-693. series for 3.10.0-693.24.1.el7.x86_64 or later
3.10.0-862.el7.x86_64 ~ 3.10.0-862.26.x.el7.x86_64
3.10.0-957.el7.x86_64 ~ 3.10.0-957.3.x.el7.x86_64

If you use the kernel version listed above and use DataKeeper resources in asynchronous mode, please update (or downgrade) to the following kernel version.
3.10.0-693. series kernel for before 3.10.0-693.24.1.el7.x86_64
3.10.0-862.29.1.el7.x86_64 or later
3.10.0-957.4.1.el7.x86_64 or later

If you cannot update (or downgrade) the kernel, do not use DataKeeper asynchronous mode.
Secure Boot is not supported on RHEL/CentOS/Oracle Linux

If Secure Boot is enabled on RHEL7 or later, CentOS7 or later, or Oracle Linux 7 or later, the nbd module fails to load. For this reason, Secure Boot cannot be enabled in a DataKeeper environment. When using UEK with Oracle Linux, Secure Boot can be enabled.

Solution: Take one of the following actions:
  1. Disable Secure Boot – Disable Secure Boot in the UEFI configuration.
  2. Disable signature verification – Disable signature verification with the “mokutil ––disable-validation“ command. See mokutil documentations for details.

    Solution 1 is recommended. Both require a system reboot.


Was this helpful?

Yes No
You indicated this topic was not helpful to you ...
Could you please leave a comment telling us why? Thank you!
Thanks for your feedback.

Post your comment on this topic.

Post Comment