Quorum/Witness functionality, combined with the existing failover process of the LifeKeeper core, allows system failover to occur with a greater degree of confidence in situations where total network failure could be common. This effectively means that local site failovers and failovers to nodes across a WAN can be done while greatly reducing the risk of split-brain situations.
In a distributed system that takes network partitioning into account, there is a concept called quorum to obtain consensus across the cluster. A node having quorum is a node that can obtain consensus of a majority of the member nodes of the cluster, and is allowed to bring resources in service. On the other hand, a node not having quorum is a node that cannot obtain consensus of a majority of cluster nodes and it is not allowed to bring resources in service, which will prevent split brain from occurring.
In the case of a communication failure, a surviving node can use status reporting from other cluster nodes, or from quorum devices, to get a “second opinion” on the status of the failing node. The node or quorum device which reports the “second opinion” is called a witness node (or a witness device), and getting a “second opinion” is called witness checking. When determining when to fail over, the witness node (the witness device) allows resources to be brought in service on a backup server only in cases where it verifies the primary server has failed and is no longer part of the cluster. This will prevent failovers from happening due to simple communication failures between nodes when those failures don’t affect the overall access to, and performance of, the in-service node. During actual operation, the witness node (the witness device) will be consulted when LifeKeeper is started or the failed communication path is restored. Witness checking can only be performed for nodes having quorum.
LifeKeeper provides two configurable components: quorum and witness. By default, all quorum and witness behavior is disabled and must be configured by the user in order to be activated.
The behavior of these modes can be customized via the %LKROOT%/etc/default/LifeKeeper configuration file, and the quorum and witness modes can be individually adjusted. LifeKeeper adds the entry “QUORUM_MODE = none” to this file; it can be manually changed later, and the updated settings will be preserved across LifeKeeper repair and upgrade installations.
Available Quorum Modes
Three quorum checking modes are available which can be set via the QUORUM_MODE setting in %LKROOT%/etc/default/LifeKeeper.
|none/off||Quorum checking is disabled. With this configuration, quorum checking is always determined to be successful.|
|majority|| With majority as the quorum mode setting quorum checks occur via LifeKeeper communication paths. A node has quorum when it is able to communicate with the majority of the nodes in the cluster. This quorum mode is available on clusters with three or more nodes.
Majority quorum mode is recommended for clusters with an odd number of nodes.
When majority quorum mode is selected, witness mode must be set to “remote_verify”.
|storage|| Storage quorum mode is recommended for clusters with an even number of nodes. Quorum checks use a “shared storage” (SMB share or S3 AWS bucket) file location. See Storage Mode for details.
When storage is selected for the quorum mode, the witness mode, which is described later, must also be set to storage.
Available Witness Modes
Three witness modes are available which can be set via the WITNESS_MODE setting in %LKROOT%/etc/default/LifeKeeper.
|none/off (default)||In this mode, witness checking is disabled. With this setting, it is always determined that there is no failure.|
|remote_verify|| Consults all the other nodes in the cluster about their view of the status of a node which appears to be failing. If any node determines that there is no failure, witness checking determines that there is no failure. If all the nodes determine that there is failure, witness checking determines that the node is failing.
When remote_verify is chosen for witness mode, quorum_mode must be set to “majority”.
|storage|| A witness mode where shared storage is used as a witness device. The shared storage device is used to “share” status information between nodes in the cluster. Each node updates its own information and reads the other nodes information. If a node detects that information for another node is not being updated then that node will be considered failed. See Storage Mode for details.
When storage is selected for the witness mode, then storage must be selected for quorum mode. See above.
Supported Combinations of Quorum Mode and Witness Mode
LifeKeeper supports the following combinations.
|WITNESS_MODE||remote_verify|| Supported for
|Not supported|| Supported for
|storage||Not Supported|| Supported
Between 2 and 4 nodes
|none/off|| Supported for
“osu” Action When Quorum is Lost
LifeKeeper’s Quorum feature will perform the “osu” action on a node when it loses quorum. The QUORUM_LOSS_ACTION setting in %LKROOT%/etc/default/LifeKeeper must be set to osu – it is the only option available in LifeKeeper for Windows.
The osu action will result in the following behavior:
- Resources that are in service on the node will be taken out of service.
- LifeKeeper will shut down for a period of time determined by the setting QUORUM_QUARANTINE_SECS, after which LifeKeeper will resume.