- Storage Quorum -
- SPS for Linux: Heartbeat recommendations for using Storage Quorum
What are SIOS Heartbeat recommendations for using Storage Quorum?
In order to use storage quorum, SIOS recommends that you increase the LCMHEARTBEATS to allow for a longer time before the path is marked as failed. This will change the timeout period from the default of 15 seconds to 45 seconds.
Edit the /etc/default/LifeKeeper file and change LCMNUMHBEATS to 9:
Since you will be making a change to a LifeKeeper core parameter, you will need to recycle LifeKeeper.
To minimize downtime, you can use “lkstop -f“ that will leave the resources running.
While LifeKeeper is stopped, failures of protected resources will not be detected or acted upon.
# lkstop -f
- SPS for Linux: Storage quorum failed to prevent failover when communication between nodes is lost
Incomplete storage quorum configuration caused failures during lost communication processing
All comm paths between cluster nodes must be created and “ALIVE” before running qwk_storage_init on each node in the cluster.
If this is not the case, execute the following commands to reinitialize the storage quorum configuration once all comm paths are ALIVE.
- SPS for Linux: How do you tune parameters for storage quorum when using Amazon S3 storage
How do you tune parameters for storage quorum when using Amazon S3 storage?
Once you set up storage quorum using the documentation, there may be questions on how to tune the heart beat parameters.
Click here for more information.
Here are several things that you can do to help determine if the default values are sufficient in your environment:
There are 2 main parameters in /etc/default/LifeKeeper that affect the storage quorum timeout value::
- QWK_STORAGE_HBEATTIME (default is 6) – Specifies the interval in seconds between reading and writing the QWK objects.
- QWK_STORAGE_NUMHBEATS (default is 4) – Specifies the number of consecutive heartbeat checks that, when missed, indicates that the target node has failed. A missed heartbeat occurs when the QWK object has not been updated since the last check.
When using an Amazon S3 bucket to store the QWK objects (i.e., QWK_STORAGE_TYPE=aws_s3), SIOS suggests running the following commands to ensure good connectivity in your environment:
- Execute ping s3.amazonaws.com and make sure the time is under a second. This ensures good connectivity from the EC2 node to the global AWS domain.
- Execute ping <bucketname>.s3.amazonaws.com, which will resolve to the IP address of the hosting S3 service. This should also be less than a second.
Another thing to consider is the amount of data being transferred for overall S3 activities on this node. It is possible that file transfers are taking place. You may measure the response time using ping, as in the examples above. (See above ping format).
By comparing network responsiveness in both high-traffic and low-traffic situations, you can tune the number of missed heartbeats (QWK_STORAGE_NUMHBEATS) and the heartbeat time (QWK_STORAGE_HBEATTIME) mentioned above.
The parameters above must be specified in /etc/default/LifeKeeper before running qwk_storage_init.
In most cases where your ping to the Amazon S3 service resolves to less than a second, the default is sufficient. However, if the connection to the Amazon S3 service is slow or you see degradation, we recommend increasing the QWK_STORAGE_HBEATTIME (see parameter above) from 6 to 7. This will increase the loss timeout from 24 seconds (6 × 4) to 28 seconds (7 × 4). SIOS does not recommend increasing the timeout much larger than 30 seconds.
If you change the default settings, be sure to change them on each system in the cluster and reinitialize storage quorum by executing the following commands on each system in the cluster while all comm paths are “ALIVE”:
- Quickcheck for mirror is constantly failing and recovering -
- Quickcheck for mirror is constantly failing and recovering
Looking at the lifekeeper.log, you can see that the mirrors are constantly failing the quickcheck, but the recover always works.
NOTIFY:lcd.recmain:recover:datarep-data:011115:BEGIN recover of “datarep-data” (class=netraid event=recover)
INFO:dr:recover:datarep-data:104008:/dev/md0: merging bitmap from target “SV-GCS-LIVEB”*
*Oct 5 05:57:07 SP-GCS-LIVEA recover : INFO:dr:recover:datarep-data:104009:/dev/md0: bitmap merged, resyncing 2.3
*Oct 5 05:57:12 SP-GCS-LIVEA recover :
INFO:dr:recover:datarep-data:104095:Partial resynchronization of component “/dev/nbd1” has begun for mirror “/dev/md0”*
This usually coincides nbd with errors in the message logs:
Oct 5 05:56:59 SP-GCS-LIVEA kernel: [5499433.410998] nbd (pid 7278: nbd-client) got signal 9
Oct 5 05:56:59 SP-GCS-LIVEA kernel: [5499433.411003] nbd1: shutting down socket
Oct 5 05:56:59 SP-GCS-LIVEA kernel: [5499433.411015] nbd1: Receive control failed (result -4)
Oct 5 05:56:59 SP-GCS-LIVEA kernel: [5499433.411039] nbd1: queue cleared
Oct 5 05:57:11 SP-GCS-LIVEA nbd-client: Begin Negotiation
Oct 5 05:57:11 SP-GCS-LIVEA nbd-client: size = 268434407424
Oct 5 05:57:11 SP-GCS-LIVEA nbd… (truncated, see original email for full text)
These messages seem to indicate nbd issues that are causing the replication connections to drop. The resync recoveries in the LifeKeeper logs are the reaction to the connection being dropped and the need to re-establish them.
When a mirror is created, its state is monitored via quickCheck and via a mdadm process on the source. The quickCheck process checks a number of items and will issue a recovery event if it finds the mirror out of sync (from /proc/mdstat info), it can’t ping the target or the target state is not alive, or if it finds that nbd-client/nbd-server processes are not running.
A recovery will also be initiated based on events from the md driver via the mdadm monitoring process. These include Fail, FailSpare and DegradedArray events (these are documented in the mdadm man page).
These events indicate issues that occurred that the md driver detected and LifeKeeper must react to so that it has the same state as the driver. Additionally, if the comm path over which the mirror is using goes down this will also lead to a recovery event.
There is one tuning option:
Increase the NBD_XMIT_TIMEOUT parameter. The default value for NBD_XMIT_TIMEOUT is 6 seconds.
Keep in mind that you do not want to raise this value by much to ensure a true hang condition on packet transmissions is detected and an abort is done to reset the connection and restart mirroring. Waiting too long could lead to hung writes on the source and eventually a hung system.