SPS for Linux: Heartbeat recommendations for using Storage Quorum

Solution Details
In order to use storage quorum, SIOS recommends that you increase the LCMHEARTBEATS to allow for a longer time before the path is marked as failed. This will change the timeout period from the default of 15 seconds to 45 seconds.

Edit the /etc/default/LifeKeeper file and change LCMNUMHBEATS to 9:

LCMNUMHBEATS=9

Since you will be making a change to a LifeKeeper core parameter, you will need to recycle LifeKeeper.

To minimize downtime, you can use “lkstop -f“ that will leave the resources running.

  • lkstop -f
  • lkstart

SPS for Linux: Storage quorum failed to prevent failover when communication between nodes is lost

Solution Details

ISSUE:
Incomplete storage quorum configuration caused failures during lost communication processing

SOLUTION:

All comm paths between cluster nodes must be created and “ALIVE” before running qwk_storage_init on each node in the cluster.

If this is not the case perform the following steps to reinitialize the storage quorum configuration.

  1. /opt/LifeKeeper/bin/qwk_storage_exit
  2. /opt/LifeKeeper/bin/qwk_storage_init

SPS for Linux: How do you tune parameters for storage quorum

Once you set up storage quorum using the documentation, there may be questions on how to tune the heart beat parameters.
Click here for more information.

Here are several things that you can do to help determine if the default values are sufficient in your environment:

There are 2 main parameters:

  • QWK_STORAGE_HBEATTIME (default is 4) – Specifies the interval in seconds between reading and writing the QWK objects.
  • QWK_STORAGE_NUMHBEATS (default is 6) – Specifies the number of consecutive heartbeat checks, when missed, indicates the target node has failed. A missed heartbeat occurs when the QWK object has not been updated since the last check.

We suggest running the following commands to ensure good connectivity in your environment:

  1. Do a ping s3.amazonaws.com and make sure the time is under a second. This ensures good connectivity from the EC2 node to the global AWS domain.

  2. Even though S3 is global service, the S3 buckets are in a region.
    Do a ping <bucketname>.s3.amazonaws.com which will resolve to the IP address of the hosting S3 service. This should be also less than a second.

  3. The only other thing to consider is the amount of data being passed for overall S3 activities on this node. It is possible that file transfers are taking place. You want to measure the response time factor using ping. (see above ping format).


Based on the added traffic and no traffic time comparisons, you can tune the number of heartbeats and the time mentioned above.
The parameters above should be specified before running the qwk_storage_init.
Defaults are 6 (minimum of 5, maximum of 10) seconds for heart beat time and 4 (minimum of 3) missed heart beats.

In most cases and if your ping resolves to less than a second, the default is sufficient, but if the S3 is slow or you do see degradation, we recommend increasing the QWK_STORAGE_HBEATTIME (see parameter above from 4 to 5. This will increase the loss timeout from 24 seconds (6 × 4) to 30 seconds (6 × 5). SIOS does not recommend increasing the timeout much larger than 30 seconds.

If you change the default settings, make sure you change them on each system in the cluster.

Feedback

Was this helpful?

Yes No
You indicated this topic was not helpful to you ...
Could you please leave a comment telling us why? Thank you!
Thanks for your feedback.

Post your comment on this topic.

Post Comment