VMDK Troubleshooting

Symptom	Possible Cause
Switching over a VMDK resource causes a system down on the original active node.	Cause : When the quickCheck daemon PID file /tmp/LK-vmdk-* of a VMDK resource is deleted, a new quickCheck daemon is started by the recovery process. The old daemon will not be stopped. Only the quickCheck daemon corresponding to the PID file is stopped when resources are removed, so the old daemon detects the VMDK detachment and stops the system. Action : Do not delete the PID file. In general, files under /tmp that have not been updated for a certain period of time are automatically deleted. This is accomplished by running tmpwatch script periodically or systemd-tmpfiles, etc. Please configure exclusions according to your environment to prevent PID files from being deleted. Example: for systemd-tmpfiles echo "x /tmp/LK-vmdk-" > /etc/tmpfiles.d/lifekeeper.conf How to check if multiple quickCheck daemons are running and what to do if they are running Run for tag in `ins_list -f, -a scsi -r vmdk \| cut -d, -f4`; do echo -n "$tag: "; pgrep -f "vmdk_quickCheck.ps1.$tag$" \| wc -l; done and if the result is not “<vmdk tag name>: 1”, execute the following: pkill -INT -f "^/opt/LifeKeeper/bin/pwsh /opt/LifeKeeper/lkadm/subsys/scsi/vmdk/bin/vmdk_quickCheck.ps1" pkill lkcheck
Mount point is not included in the selection when creating resources	Possible causes are as follows: PowerShell/PowerCLI is not installed An ESXi host is not registered disk.enableUUID parameter is not set The virtual hard disk is on a datastore that is not shared SCSI controller sharing is configured as “virtual” or “physical” Error details are recorded in /var/log/lifekeeper.log. Check the log and review the settings.
It takes longer to bring the VMDK resource in service.	Cause: The processing performed during bringing the resource in service takes more time in proportion to the number of virtual machines running on the ESXi host. Action: A fundamental fix is under consideration for a future release. As an immediate workaround for this issue, please consider the following: Reduce the number of VMDK resources. Since the process is performed for each VMDK resource, the time required for the process increases in proportion to the number of VMDK resources. If multiple partitions or file systems are required for a single resource hierarchy, create multiple partitions or file systems on a single VMDK resource rather than using multiple VMDK resources. If ESXi hosts that are not related to the cluster node are registered with the VMDK resource, unregister them. How to manage ESXi host information Reduce the number of running virtual machines.

Symptom

Possible Cause

Switching over a VMDK resource causes a system down on the original active node.

Cause : When the quickCheck daemon PID file /tmp/LK-vmdk-* of a VMDK resource is deleted, a new quickCheck daemon is started by the recovery process. The old daemon will not be stopped. Only the quickCheck daemon corresponding to the PID file is stopped when resources are removed, so the old daemon detects the VMDK detachment and stops the system.

Action : Do not delete the PID file.
In general, files under /tmp that have not been updated for a certain period of time are automatically deleted. This is accomplished by running tmpwatch script periodically or systemd-tmpfiles, etc. Please configure exclusions according to your environment to prevent PID files from being deleted.

Example: for systemd-tmpfiles

echo "x /tmp/LK-vmdk-*" > /etc/tmpfiles.d/lifekeeper.conf

How to check if multiple quickCheck daemons are running and what to do if they are running
Run

for tag in `ins_list -f, -a scsi -r vmdk | cut -d, -f4`; do echo -n "$tag: "; pgrep -f "vmdk_quickCheck.ps1.*$tag$" | wc -l; done

and if the result is not “<vmdk tag name>: 1”, execute the following:

pkill -INT -f "^/opt/LifeKeeper/bin/pwsh /opt/LifeKeeper/lkadm/subsys/scsi/vmdk/bin/vmdk_quickCheck.ps1"
pkill lkcheck

Mount point is not included in the selection when creating resources

Possible causes are as follows:

PowerShell/PowerCLI is not installed
An ESXi host is not registered
disk.enableUUID parameter is not set
The virtual hard disk is on a datastore that is not shared
SCSI controller sharing is configured as “virtual” or “physical”

Error details are recorded in /var/log/lifekeeper.log. Check the log and review the settings.

It takes longer to bring the VMDK resource in service.

Cause: The processing performed during bringing the resource in service takes more time in proportion to the number of virtual machines running on the ESXi host.

Action: A fundamental fix is under consideration for a future release. As an immediate workaround for this issue, please consider the following:

Reduce the number of VMDK resources.
Since the process is performed for each VMDK resource, the time required for the process increases in proportion to the number of VMDK resources. If multiple partitions or file systems are required for a single resource hierarchy, create multiple partitions or file systems on a single VMDK resource rather than using multiple VMDK resources.
If ESXi hosts that are not related to the cluster node are registered with the VMDK resource, unregister them.
How to manage ESXi host information
Reduce the number of running virtual machines.

VMDK Maintenance

VMDK Error Messages

Feedback

Post your comment on this topic.

Feedback

Was this helpful?