- LifeKeeper Core -

Allowing LifeKeeper sufficient time to start up before performing actions

Solution Details

PROBLEM:

If LifeKeeper is not allowed sufficient time to start up before taking actions through it, it can lead to certain scenarios where actions may fail when they shouldn’t because a certain process may not have finished initializing. It is important to allow LifeKeeper sufficient time to startup and initialize it’s processes.

When starting LifeKeeper on multiple nodes as part of the same activity, it is important that when you run lkstart on one node you give it enough time to finish before running lkstart on the next node.

SOLUTION:

Because several processes performed on startup can take a variable amount of time, there is no set amount of time that you should wait before starting LifeKeeper elsewhere or performing actions. Likewise, it will not take the same amount of time on every startup, so do not measure it once and then assume that time will always be accurate.

Instead, use the LifeKeeper logs to determine when it is done starting up and initializing. The last thing LifeKeeper does on startup is try to initialize the LifeKeeper Resources. Once you see that a log message stating “RESOURCE INITIALIZATION FINISHED” after the most recent start of LifeKeeper, then it is okay to start it on the next node in the cluster, or begin performing actions through the LifeKeeper GUI or CLI.

Cannot failover when using VEEAM Backup in conjunction with LifeKeeper for Linux

SOLUTION:

To work around this issue, a gen app needs to be created to remove the ‘veeamsnap’ kernel module. When the veeamsnap kernel is loaded, the module is preventing LifeKeeper’s ability to stop the mirror via ‘mdadm – –stop’ as the mirror still appears to be “in use’. The gen app resource for Veeam needs 2 scripts.

  • The first is a restore script that does an ‘exit 0’ and is used to bring the new gen app resource in service.
  • The second script is a remove script that will unload the ‘veeamsnap’ kernel module via the command ‘rmmod veeamsnap’.

    Once the ‘veeamsnap’ module is unloaded it will allow the ‘mdadm – –stop’ command to stop the mirror and successfully remove the DataKeeper resource. The remove script takes the resource out of service.


Example of the gen app restore script (3 lines including blank lines):
#!/bin/bash

exit 0

Example of the gen app remove script (5 lines including blank lines):
#!/bin/bash

rmmod veeamsnap

exit 0

When creating the gen app resource, these scripts are used for the restore and remove scripts inputs when prompted during creation via the LifeKeeper GUI. No quickCheck or recover script is needed for the gen app resource, those can be left blank.

The placement of this new resource instance is critical to ensure the proper operation of LifeKeeper and the protected resources. The gen app cannot be placed between the file system resource and the data replication resource as that would cause failures within LifeKeeper. Because of this. the new gen app resource for Veeam must be placed as a parent of the file system resource.

A typical hierarchy contains an Application resource (such as a database) above file system resources which will be referred to as the Parent. Under this Parent resource are one or more file system resources. The new Veeam gen app resource would be placed as a dependent child resource to the Parent and each file system with a data replication would then be attached as a dependent child to the new Veeam gen app. This will require the removal of existing resource dependencies and the creation of new ones.

Here is an example of how the hierarchy should look before and after the creation of the gen app:

Before:
App
–File system
– –datarep-123

After:
App
–Veeam gen app
– –File system
– – –datarep 123

The dependency between the ‘App’ resource and the ‘File system’ resource is removed and 2 new dependencies are created,

  • One between the ‘App’ resource and the ‘Veeam gen app’ resource
  • One between the ‘Veeam gen app’ resource and the ‘File system’ resource

    If more than one ‘File system’ resource exists as a dependent of the ‘App’ resource then those dependencies will need to be removed and new dependencies made from the ‘Veeam gen app’ to the additional ‘File system’ resources.

    For more information on creating and deleting resource dependencies see Creating a Resource Dependency and Deleting a Resource Dependency.

- SAP/Oracle Patching -

Patching Oracle nodes (SAP/Oracle) with DataKeeper

Solution Details

Most of the Oracle updates / patches only require access to the system tables drives and not to the data drives (protected DataKeeper data drives).

The following procedure can be used in general:

Prior to the procedure, set block failover on primary and confirm failover on target.

  1. On the standby / target server:
    1. Apply the appropriate upgrades, patches, etc.
    2. Reboot your server
  2. Then, switchover your resources from the source / active server to the standby / target server:
    1. Verify that you can access the Oracle database
  3. On the other node (which now is a standby after the switchover and the original source / active server):
    1. Apply the appropriate upgrades, patches, etc.
    2. Reboot your server
  4. (Optional) Switchover again to verify connectivity and access to Oracle on the original / source /active server

In some cases where the patches require access to the DataKeeper volumes (e.g. running catsbp), you need to pause / unlock the mirrors to perform the upgrades on the standby / target system.

Verify that Oracle can be started on the backup, then stop Oracle on the backup node and continue the mirrors. Then repeat the upgrade on the source system.

At the end of the procedure remove the flags ‘block failover’ on primary and the ‘confirm failover’ on target.

- Storage Quorum -

Heartbeat recommendations for using Storage Quorum

Solution Details

ISSUE:
What are SIOS Heartbeat recommendations for using Storage Quorum?

SOLUTION:

In order to use storage quorum, SIOS recommends that you increase the LCMHEARTBEATS to allow for a longer time before the path is marked as failed. This will change the timeout period from the default of 15 seconds to 45 seconds.

Edit the /etc/default/LifeKeeper file and change LCMNUMHBEATS to 9:

LCMNUMHBEATS=9

Since you will be making a change to a LifeKeeper core parameter, you will need to recycle LifeKeeper.

To minimize downtime, you can use “lkstop -f“ that will leave the resources running.
While LifeKeeper is stopped, failures of protected resources will not be detected or acted upon.

# lkstop -f

# lkstart

Storage quorum failed to prevent failover when communication between nodes is lost

Solution Details

ISSUE:
Incomplete storage quorum configuration caused failures during lost communication processing

SOLUTION:

All comm paths between cluster nodes must be created and “ALIVE” before running qwk_storage_init on each node in the cluster.

If this is not the case, execute the following commands to reinitialize the storage quorum configuration once all comm paths are ALIVE.

# /opt/LifeKeeper/bin/qwk_storage_exit

# /opt/LifeKeeper/bin/qwk_storage_init

How do you tune parameters for storage quorum when using Amazon S3 storage?

Solution Details

ISSUE:
How do you tune parameters for storage quorum when using Amazon S3 storage?

SOLUTION:

Once you set up storage quorum using the documentation, there may be questions on how to tune the heart beat parameters.
Click here for more information.

Here are several things that you can do to help determine if the default values are sufficient in your environment:

There are 2 main parameters in /etc/default/LifeKeeper that affect the storage quorum timeout value::

  • QWK_STORAGE_HBEATTIME (default is 6) – Specifies the interval in seconds between reading and writing the QWK objects.
  • QWK_STORAGE_NUMHBEATS (default is 4) – Specifies the number of consecutive heartbeat checks that, when missed, indicates that the target node has failed. A missed heartbeat occurs when the QWK object has not been updated since the last check.

When using an Amazon S3 bucket to store the QWK objects (i.e., QWK_STORAGE_TYPE=aws_s3), SIOS suggests running the following commands to ensure good connectivity in your environment:

  1. Execute ping s3.amazonaws.com and make sure the time is under a second. This ensures good connectivity from the EC2 node to the global AWS domain.

  1. Execute ping <bucketname>.s3.amazonaws.com, which will resolve to the IP address of the hosting S3 service. This should also be less than a second.

    Another thing to consider is the amount of data being transferred for overall S3 activities on this node. It is possible that file transfers are taking place. You may measure the response time using ping, as in the examples above. (See above ping format).


By comparing network responsiveness in both high-traffic and low-traffic situations, you can tune the number of missed heartbeats (QWK_STORAGE_NUMHBEATS) and the heartbeat time (QWK_STORAGE_HBEATTIME) mentioned above.

The parameters above must be specified in /etc/default/LifeKeeper before running qwk_storage_init.

In most cases where your ping to the Amazon S3 service resolves to less than a second, the default is sufficient. However, if the connection to the Amazon S3 service is slow or you see degradation, we recommend increasing the QWK_STORAGE_HBEATTIME (see parameter above) from 6 to 7. This will increase the loss timeout from 24 seconds (6 × 4) to 28 seconds (7 × 4). SIOS does not recommend increasing the timeout much larger than 30 seconds.

If you change the default settings, be sure to change them on each system in the cluster and reinitialize storage quorum by executing the following commands on each system in the cluster while all comm paths are “ALIVE”:

# /opt/LifeKeeper/bin/qwk_storage_exit

# /opt/LifeKeeper/bin/qwk_storage_init

- Majority Mode -

Shutdown and restart procedure for LifeKeeper with a witness node

1. Stop the backup (target) server using lkstop

Wait for lkstop command to complete. Verify lkstop is complete by log entry:

NOTIFY:shutdown:::010055:LifeKeeper stopped

Quorum message:

NOTIFY:event.comm_down:::010469:We do have quorum on comm_down, continuing

Typically this is less than 2 minutes

2. Stop the primary (source) server using lkstop.

Wait for lkstop command to complete. Verify lkstop is complete by log entry:

NOTIFY:shutdown:::010055:LifeKeeper stopped

Quorum message:

NOTIFY:event.comm_down:::010469:We do have quorum on comm_down, continuing

Typically this is less than 2 minutes

3. Stop LifeKeeper on the witness node using lkstop.

Wait for lkstop command to complete. Verify lkstop is complete by log entry:

NOTIFY:shutdown:::010055:LifeKeeper stopped

Typically this is less than 2 minutes

4. Bring the witness server up using lkstart.

Wait for lkstart command to complete. Verify lkstart is complete by log entry:

INFO:event.lcm_avail:::010479:RESOURCE INITIALIZATION FINISHED

Typically this is less than 2 minutes

5. Bring the backup (target) server up using lkstart.

Wait for lkstart command to complete. Verify lkstart is complete by log entry:

INFO:event.lcm_avail:::010479:RESOURCE INITIALIZATION FINISHED

Quorum message:

INFO:event.comm_up:::010490:We do have quorum on comm_up to , putting resources into service if needed

Typically this is less than 2 minutes

6. Bring the primary (source) server up using lkstart.

Wait for lkstart command to complete. Verify lkstart is complete by log entry:

INFO:event.lcm_avail:::010479:RESOURCE INITIALIZATION FINISHED

Quorum message:

INFO:event.comm_up:::010490:We do have quorum on comm_up to <node>, putting resources into service if needed

Typically this is less than 2 minutes

Note: By taking the servers down in this order, the resources will stay in-service on the primary server (not prompt a failover). Since the mirrors were not in-service on the backup server when it went down they will not go into service when the server comes back up. They will stay in the out-of-service (OSU) state.

Note: When LifeKeeper starts on the primary server, LK will bring the mirrors into service because they were ISP when LifeKeeper went down and they are not in-service on the backup server. LifeKeeper will determine the backup server is ready and will start the resync from the primary server to the backup server.

- SAP HANA -

Instructions for patching the HANA database

Solution Details

ISSUE:
What are the step by step instructions for patching the HANA database?

SOLUTION:

Node 1 = source
Node 2 = backup

  1. Disable quickCheck (set LKCHECKINTERVAL=0, killall lkcheck) on node 1
  2. Stop LifeKeeper using “lkstop –f” on node 2
    1. Verify that HSR is running from node 1 to 2
  3. Stop HANA on node 2, patch
  4. Start HANA on node 2
    1. Verify that HSR is still running
  5. Start LifeKeeper using “lkstart” on node 2
  6. Switchover resources to node 2
  7. Repeat steps above to patch node 1
  8. Switch to node 1 if necessary
  9. Re-enable quickCheck (set LKCHECKINTERVAL to previous value)

- Quickcheck for mirror is constantly failing and recovering -

Quickcheck for mirror is constantly failing and recovering

Solution Details

ISSUE:
Looking at the lifekeeper.log, you can see that the mirrors are constantly failing the quickcheck, but the recover always works.

Sample messages:

NOTIFY:lcd.recmain:recover:datarep-data:011115:BEGIN recover of “datarep-data” (class=netraid event=recover)
INFO:dr:recover:datarep-data:104008:/dev/md0: merging bitmap from target “SV-GCS-LIVEB”*
*Oct 5 05:57:07 SP-GCS-LIVEA recover [11754]: INFO:dr:recover:datarep-data:104009:/dev/md0: bitmap merged, resyncing 2.3
*Oct 5 05:57:12 SP-GCS-LIVEA recover [11754]:
INFO:dr:recover:datarep-data:104095:Partial resynchronization of component “/dev/nbd1” has begun for mirror “/dev/md0”*

This usually coincides nbd with errors in the message logs:

Oct 5 05:56:59 SP-GCS-LIVEA kernel: [5499433.410998] nbd (pid 7278: nbd-client) got signal 9
Oct 5 05:56:59 SP-GCS-LIVEA kernel: [5499433.411003] nbd1: shutting down socket
Oct 5 05:56:59 SP-GCS-LIVEA kernel: [5499433.411015] nbd1: Receive control failed (result -4)
Oct 5 05:56:59 SP-GCS-LIVEA kernel: [5499433.411039] nbd1: queue cleared

Oct 5 05:57:11 SP-GCS-LIVEA nbd-client: Begin Negotiation
Oct 5 05:57:11 SP-GCS-LIVEA nbd-client: size = 268434407424
Oct 5 05:57:11 SP-GCS-LIVEA nbd… (truncated, see original email for full text)

SOLUTION:

These messages seem to indicate nbd issues that are causing the replication connections to drop. The resync recoveries in the LifeKeeper logs are the reaction to the connection being dropped and the need to re-establish them.

When a mirror is created, its state is monitored via quickCheck and via a mdadm process on the source. The quickCheck process checks a number of items and will issue a recovery event if it finds the mirror out of sync (from /proc/mdstat info), it can’t ping the target or the target state is not alive, or if it finds that nbd-client/nbd-server processes are not running.

A recovery will also be initiated based on events from the md driver via the mdadm monitoring process. These include Fail, FailSpare and DegradedArray events (these are documented in the mdadm man page).

These events indicate issues that occurred that the md driver detected and LifeKeeper must react to so that it has the same state as the driver. Additionally, if the comm path over which the mirror is using goes down this will also lead to a recovery event.

There is one tuning option:

Increase the NBD_XMIT_TIMEOUT parameter. The default value for NBD_XMIT_TIMEOUT is 6 seconds.

Keep in mind that you do not want to raise this value by much to ensure a true hang condition on packet transmissions is detected and an abort is done to reset the connection and restart mirroring. Waiting too long could lead to hung writes on the source and eventually a hung system.

- EC2 -

Curl call failed (err=7)(output=curl: (7) Failed to connect (EC2 Kit)

Solution Details

ISSUE:
Problem occurs with AWS CLI and the EC2 Recovery Kit
The logs show that a curl call (AWS CLI to IMDS) failed with err=7, and subsequently, the quickcheck script missed a check and triggered a failover. The failover was unnecessary because the cURL call timeout was set too short at 5 seconds, and it was triggered by an intermittent network issue with AWS. If the curl call had been given a little more time, the failover could have been prevented.

Error Message:
ERROR:ec2:quickCheck:ec2-1.1.1.1:123456:curl call failed
(err=7)(output=curl: (7) Failed to connect to 169.254.169.254 port 80:
Connection refused)

169.254.169.254 port 80 is the AWS Instance MetaData Service

SOLUTION:
To prevent these unneeded failovers from occurring there is a tunable that may be modified with the /etc/defaults/LifeKeeper tunables file. The parameter to modify is:
LK_CURL_TIMEOUT=X

  • The default value of this tunable is 5
  • The value will be set to 5 if the tunable line is not present in the /etc/defaults/LifeKeeper file.


To modify either change the numerical value OR, if the tunable is not present add the following line:
LK_CURL_TIMEOUT=10

Increase the timeout value in small increments based on the length of intermittent network failures that your environment may experience.
Generally a timeout value of 10 seems to prevent the issue for most network hiccups.

Notable:
If you modify to 10 and experience the issue again increase the value by a few seconds.

Feedback

Was this helpful?

Yes No
You indicated this topic was not helpful to you ...
Could you please leave a comment telling us why? Thank you!
Thanks for your feedback.

Post your comment on this topic.

Post Comment