To avoid replicating corrupt or inconsistent data to targets, LifeKeeper can wait for resources to be in-service before replicating data. Starting with 9.5.2, LifeKeeper will by default wait for parent resources of application type ‘filesys’ to be in-service before replicating data. This provides assurance that the file system is consistent before replicating data. If the file system fails to mount, the appropriate recovery action could be to repair the file system on the current system or check the data on another system before the data is replicated.
The “Wait to Resync” feature is configured in /etc/default/LifeKeeper with the setting “LKDR_WAIT_TO_RESYNC”. There are 3 options to configure the “Wait to Resync” feature:
- False. This will disable the feature. Parent resources will not be checked before initializing resynchronization. Resynchronization will begin during the initial restore of the DataKeeper resource if there are no other issues blocking resynchronization.
- <resource type>. Specify a specific resource type. By default the ‘filesys’ resource type is specified. This setting can be any installed resource type. The command /opt/LifeKeeper/bin/typ_list -f: will display a list of installed resource types. The output of this command is a list of the form “application:type”. For example,
The first field is the application and the second field following “:” is the type (see Short Status Display for more information). For example, the first entry in the list above is application “gen” and type “app”. Any type on the right side of the “:” can be specified. LifeKeeper will wait for parents of the DataKeeper resource which have the specified type to be in-service before resynchronization will begin. For example, suppose that the resource type “LKDR_WAIT_TO_RESYNC=app” is defined in /etc/default/LifeKeeper and that the resource hierarchy is as follows:
where “app1”, “app2”, and “app3” have resource type “app”. In this example, netraid resource “datarep-maxdb1” will not synchronize until all three gen:app parent resources are in-service. The netraid resource “datarep-maxdb2” will not synchronize until “app1” is in-service since “app1” is its only gen:app parent.
- Hierarchy. All parent resources of a DataKeeper resource must be in-service before synchronization will begin.
There are 3 ways a user will see information about this feature:
- The GUI will show the target in the “Wait to Resync” state when a required parent resource is out of service:
- A warning message is logged during the in-service operation when a required parent resource is out of service. The message will specify the type that is being checked,and will also specify that the full hierarchy is being checked in the case that “LKDR_WAIT_TO_RESYNC=hierarchy”.
a. Example log message shown during the in-service operation where LKDR_WAIT_TO_RESYNC=filesys:
b. Example log message in the log file where LKDR_WAIT_TO_RESYNC=hierarchy:
WARN:dr:recover:datarep-maxdb:104237:Mirror “datarep-maxdb” will wait to reconnect targets until parent file system “/maxdb” is in-service. To reconnect targets immediately run: “/opt/LifeKeeper/bin/mirror_action datarep-maxdb resume” on “ip-10-0-2-128” (see “LKDR_WAIT_FOR_FILE_SYSTEM_TO_MOUNT” in /etc/default/LifeKeeper).
- An emergency message is logged both to the LifeKeeper log file as well as all open terminals when a required parent resource is in the failed state (OSF).
a. In this case, the GUI will show the same “Wait to Resync” status.
b. The following message will be logged to all terminals and to the LifeKeeper log file:
EMERG:dr:recover:datarep-maxdb:104236:Resource “/maxdb” is “OSF”. The mirror “datarep-maxdb” will wait to reconnect targets until parent file system “/maxdb” is in-service. This may indicate inconsistent data. Do not bring the resource “/maxdb” in-service until the data has been verified; replication will continue when “/maxdb” is in-service. A full resync may be necessary (see “LKDR_WAIT_FOR_FILE_SYSTEM_TO_MOUNT” in /etc/default/LifeKeeper).
c. Do not bring the file system resource in-service until the data on the file system is verified. Once the file system resource is brought in-service replication will resume during the next quickCheck cycle for the DataKeeper resource.
d. The failed file system resource indicates that mount, log replay, and fsck are unable to repair the file system. This may indicate inconsistent data on the disk if the file system is able to come in-service on another node. Once the file system is repaired either by recovery on the node that failed or by switching to another node, a full resync may be necessary to ensure that all nodes have consistent data.
e. The mirror can be mounted on a target to check if the file system can be mounted and the data verified. This can be done using the “Pause Mirror” feature.
When the mirror is paused LifeKeeper will automatically mount the file system on the target. If this is successful, verify the data is correct. The mirror can be “forced online” by choosing the “Force Mirror Online” option on the target (“paused” server). WARNING: This operation will not resume replication until the appropriate parent resources are in-service as defined by the LK_WAIT_TO_RESYNC on the server where the “force online” is being performed.
DO NOT choose the “resume Replication” option until there is high confidence that the data on the source is correct. Choosing “Resume Replication” after pausing the mirror will resume replication on the source though parent resources are not in-service (even if they are failed).
f. Once the data has been verified and the faulty parent resource has been repaired and brought in-service, the next quickCheck cycle for the DataKeeper resource will detect that it should no longer wait to resynchronize the data. There will be a time period (up to 2 minutes) before this quickCheck cycle where the “Wait to Resync” state is displayed even though the parent resource is in-service.
g. If corruption is found (for example, the file system will mount on one node but will not on another) or suspected, then a full resync is advised. The mirror must first be paused to force a full resync. Once the mirror is paused run the following command on the source system:
/opt/LifeKeeper/bin/mirror_action <tag> fullresync