Quorum/Witness

Quorum/Witness functionality, combined with the existing failover process of the LifeKeeper core, allows system failover to occur with a greater degree of confidence in situations where total network failure could be common. This effectively means that local site failovers and failovers to nodes across a WAN can be done while greatly reducing the risk of split-brain situations.

Quorum

In a distributed system that takes network partitioning into account, there is a concept called quorum to obtain consensus across the cluster. A node having quorum is a node that can obtain consensus of a majority of the member nodes of the cluster, and is allowed to bring resources in service. On the other hand, a node not having quorum is a node that cannot obtain consensus of a majority of cluster nodes and it is not allowed to bring resources in service, which will prevent split brain from occurring.

Witness

In the case of a communication failure, a surviving node can use status reporting from other cluster nodes, or from quorum devices, to get a “second opinion” on the status of the failing node. The node or quorum device which reports the “second opinion” is called a witness node (or a witness device), and getting a “second opinion” is called witness checking. When determining when to fail over, the witness node (the witness device) allows resources to be brought in service on a backup server only in cases where it verifies the primary server has failed and is no longer part of the cluster. This will prevent failovers from happening due to simple communication failures between nodes when those failures don’t affect the overall access to, and performance of, the in-service node. During actual operation, the witness node (the witness device) will be consulted when LifeKeeper is started or the failed communication path is restored. Witness checking can only be performed for nodes having quorum.

Configurable Components

LifeKeeper provides two configurable components: quorum and witness. By default, all quorum and witness behavior is disabled and must be configured by the user in order to be activated.

The behavior of these modes can be customized via the %LKROOT%/etc/default/LifeKeeper configuration file, and the quorum and witness modes can be individually adjusted. LifeKeeper adds the entry “QUORUM_MODE = none” to this file; it can be manually changed later, and the updated settings will be preserved across LifeKeeper repair and upgrade installations.

Available Quorum Modes

Three quorum checking modes are available which can be set via the QUORUM_MODE setting in %LKROOT%/etc/default/LifeKeeper.

QUORUM_MODE Description
none/off Quorum checking is disabled. With this configuration, quorum checking is always determined to be successful.
majority With majority as the quorum mode setting quorum checks occur via LifeKeeper communication paths. A node has quorum when it is able to communicate with the majority of the nodes in the cluster. This quorum mode is available on clusters with three or more nodes.

Majority quorum mode is recommended for clusters with an odd number of nodes.

When majority quorum mode is selected, witness mode must be set to “remote_verify”.
storage Storage quorum mode is recommended for clusters with an even number of nodes. Quorum checks use a “shared storage” (SMB share or S3 AWS bucket) file location. See Storage Mode for details.

When storage is selected for the quorum mode, the witness mode, which is described later, must also be set to storage.

Available Witness Modes

Three witness modes are available which can be set via the WITNESS_MODE setting in %LKROOT%/etc/default/LifeKeeper.

WITNESS_MODE Description
none/off (default) In this mode, witness checking is disabled. With this setting, it is always determined that there is no failure.
remote_verify Consults all the other nodes in the cluster about their view of the status of a node which appears to be failing. If any node determines that there is no failure, witness checking determines that there is no failure. If all the nodes determine that there is failure, witness checking determines that the node is failing.

When remote_verify is chosen for witness mode, quorum_mode must be set to “majority”.
storage A witness mode where shared storage is used as a witness device. The shared storage device is used to “share” status information between nodes in the cluster. Each node updates its own information and reads the other nodes information. If a node detects that information for another node is not being updated then that node will be considered failed. See Storage Mode for details.

When storage is selected for the witness mode, then storage must be selected for quorum mode. See above.

Supported Combinations of Quorum Mode and Witness Mode

LifeKeeper supports the following combinations.

QUORUM_MODE
majority storage none/off
WITNESS_MODE remote_verify Supported for
3 nodes
Not supported Supported for
3 nodes
storage Not Supported Supported
Between 2 and 4 nodes
Not supported
none/off Supported for
3 nodes
Not supported Supported

“osu” Action When Quorum is Lost

LifeKeeper’s Quorum feature will perform the “osu” action on a node when it loses quorum. The QUORUM_LOSS_ACTION setting in %LKROOT%/etc/default/LifeKeeper must be set to osu – it is the only option available in LifeKeeper for Windows.

The osu action will result in the following behavior:

  • Resources that are in service on the node will be taken out of service.
  • LifeKeeper will shut down for a period of time determined by the setting QUORUM_QUARANTINE_SECS, after which LifeKeeper will resume.

Feedback

Was this helpful?

Yes No
You indicated this topic was not helpful to you ...
Could you please leave a comment telling us why? Thank you!
Thanks for your feedback.

Post your comment on this topic.

Post Comment