You are here: LifeKeeper > Installation and Configuration > Fencing > Quorum Witness

Quorum/Witness

Quorum/Witness Server Support Package for LifeKeeper

Feature Summary

The Quorum/Witness Server Support Package for LifeKeeper (steeleye-lkQWK) combined with the existing failover process of the LifeKeeper core allows system failover to occur with a greater degree of confidence in situations where total network failure could be common. This effectively means that local site failovers and failovers to nodes across a WAN can be done while greatly reducing the risk of “split-brain” situations. The package will provide a majority-based quorum check to handle clusters with greater than two nodes. This additional quorum logic will only be enabled if the witness support package is installed.

Using one or more witness servers will allow a node, prior to bringing resources in service after a communication failure, to get a “second opinion” on the status of the failing node. The witness server is an additional server that acts as an intermediary to determine which servers are part of the cluster. When determining when to fail over, the witness server allows resources to be brought in service on a backup server only in cases where it verifies the primary server has failed and is no longer part of the cluster. This will prevent failovers from happening due to simple communication failures between nodes when those failures don’t affect the overall access to, and performance of, the in-service node. During actual operation, for the initial implementation, all other nodes in the cluster will be consulted, including the witness node(s).

Package Requirements

In addition to the requirements already discussed, this package requires that standard, licensed LifeKeeper core be installed on the server(s) that will act as the witness server(s). Note: As long as communication paths are configured correctly, multiple clusters can share a single quorum/witness server (for more information, see “Additional Configuration for Shared-Witness Topologies” below).

All nodes which will participate in a quorum/witness mode cluster, including witness-only nodes, should be installed with the Quorum/Witness Server Support Package for LifeKeeper. If using the tcp_remote quorum mode, the hosts configured in QUORUM_HOSTS within /etc/default/LifeKeeper are not required to be installed with the Quorum/Witness Server Support Package for LifeKeeper.

Package Installation and Configuration

The Quorum/Witness Server Support Package for LifeKeeper will need to be installed on each server in the quorum/witness mode cluster, including any witness-only servers. The only configuration requirement for the witness node is to create appropriate comm paths.

The general process for adding a witness server(s) will involve the following steps:

Once this is complete, the cluster should behave in quorum/witness mode, and failovers will consult other nodes including the witness node prior to a failover being allowed. The default configuration after installing the package will enable majority-based quorum and witness checks.

Note: Due to majority-based quorum, it is recommended that the clusters always be configured with an odd number of nodes.

See the Configurable Components section below for additional configuration options.

Note: Any node with the witness package installed can participate in witness functionality. The witness-only nodes will simply have a compatible version of the LifeKeeper core, the witness package installed and will not host any protected resources.

Configurable Components

The quorum/witness package contains two configurable modes: quorum and witness. By default, installing the quorum/witness package will enable both quorum and witness modes suitable for most environments that need witness features.

The behavior of these modes can be customized via the /etc/default/LifeKeeper configuration file, and the quorum and witness modes can be individually adjusted. The package installs default settings into the configuration file when it is installed, majority being the default quorum mode and remote_verify being the default witness mode. An example is shown below:

QUORUM_MODE=majority
WITNESS_MODE=remote_verify

Note: Although each cluster node can have an entirely different witness/quorum configuration, it is recommended that all nodes have the same configuration to avoid unexpected, and difficult to diagnose, situations.

Available Quorum Modes

Three quorum checking modes are available which can be set via the QUORUM_MODE setting in /etc/default/LifeKeeper: majority (the default), tcp_remote and none/off. Each of these is described below:

majority

The majority setting, which is the default, will determine quorum based on the number of visible/alive LifeKeeper nodes at the time of the check. This check is a simple majority -- if more than half the total nodes are visible, then the node has quorum.

tcp_remote

The tcp_remote quorum mode is similar to majority mode except:

Additional configuration is required for this mode since the TCP timeout allowance (QUORUM_TIMEOUT_SECS) and the hosts to consult (QUORUM_HOSTS) must be added to /etc/default/LifeKeeper. An example configuration for tcp_remote is shown below:

QUORUM_MODE=tcp_remote

# What style of quorum verification do we do in comm_up/down
# and lcm_avail (maybe other) event handlers.
# The possible values are:
# - none/off: Do nothing, skip the check, assume all is well.
# - majority: Verify that this node and the nodes it can reach
# have more than half the cluster nodes.
# - tcp_remote: Verify that this node can reach more than half
# of the QUORUM_HOSTS via tcp/ip.

QUORUM_HOSTS=myhost:80,router1:443,router2:22

# If QUORUM_MODE eq tcp_remote, this should be a comma delimited
# list of host:port values – like myhost:80,router1:443,router2:22.
# This doesn't matter if the QUORUM_MODE is something else. QUORUM_TIMEOUT_SECS=20
# The time allowed for tcp/ip witness connections to complete.
# Connections that don't complete within this time are treated
# as failed/unavailable.
# This only applies when the QUORUM_MODE is tcp_remote.

WITNESS_MODE=remote_verify

# This can be either off/none or remote_verify. In remote_verify
# mode, core event handlers (comm_down) will doublecheck the
# death of a system by seeing if other visible nodes
# also think it is dead.

QUORUM_LOSS_ACTION=fastboot

# This can be one of osu, fastkill or fastboot.
# fastboot will IMMEDIATELY reboot the system if a loss of quorum
# is detected.
# fastkill will IMMEDIATELY halt/power off the system upon
# loss of quorum.
# osu will just take any in-service resources out of service.
# Note: this action does not sync disks or unmount filesystems.

QUORUM_DEBUG=

# Set to true/on/1 to enable debug messages from the Quorum
# modules.

HIDE_GUI_SYS_LIST=true

Note: Due to the inherent flexibility and complexity of this mode, it should be used with caution by someone experienced with both LifeKeeper and the particular network/cluster configuration involved.

none/off

In this mode, all quorum checking is disabled. This causes the quorum checks to operate as if the node always has quorum regardless of the true state of the cluster.

Available Witness Modes

Two witness modes are available which can be set via the WITNESS_MODE setting in the /etc/default/LifeKeeper: remote_verify and none/off. Each of these is described below:

remote_verify

In this default mode, witness checks are done to verify the status of a node. This is typically done when a node appears to be failing. It enables a node to consult all the other visible nodes in the cluster about their view of the status of the failing machine to double-check the communications.

none/off

In this mode, witness checking is disabled. In the case of a communication failure, this causes the logic to behave exactly as if there was no witness functionality installed.

Note: It would be unnecessary for witness checks to ever be performed by servers acting as dedicated quorum/witness nodes that do not host resources; therefore, this setting should be set to none/off on these servers.

Available Actions When Quorum is Lost

The witness package offers three different options for how the system should react if quorum is lost -- “fastboot”, “fastkill” and “osu”. These options can be selected via the QUORUM_LOSS_ACTION setting in /etc/default/LifeKeeper. All three options take the system’s resources out of service; however, they each allow a different behavior. The default option, when the quorum package is installed, is fastboot. Each of these options is described below:

fastboot

If the fastboot option is selected, the system will be immediately rebooted when a loss of quorum is detected (from a communication failure). Although this is an aggressive option, it ensures that the system will be disconnected from any external resources right away. In many cases, such as with storage-level replication, this immediate release of resources is desired.

Two important notes on this option are:

  1. The system performs an immediate hard reboot without first performing any shut-down procedure; no tasks are performed (disk syncing, etc.).

  2. The system will come back up performing normal startup routines, including negotiating storage and resource access, etc.

fastkill

The fastkill option is very similar to the fastboot option, but instead of a hard reboot, the system will immediately halt when quorum is lost. As with the fastboot option, no tasks are performed (disk syncing, etc.), and the system will then need to be manually rebooted and will come back up performing normal startup routines, including negotiating storage and resource access, etc.

osu

The osu option is the least aggressive option, leaving the system operational but taking resources out of service on the system where quorum is lost. In some cluster configurations, this is all that is needed, but it may not be strong enough or fast enough in others.

Additional Configuration for Shared-Witness Topologies

When a quorum witness server will be shared by more than one cluster, it can be configured to simplify individual cluster management. In standard operation, the LifeKeeper GUI will try to connect to all cluster nodes at once when connected to the first node. It connects to all the systems that can be seen by each system in the cluster. Since the shared witness server is connected to all clusters, this will cause the GUI to connect to all systems in all clusters visible to the witness node.

To avoid this situation, the HIDE_GUI_SYS_LIST configuration parameter should be set to “true” on any shared witness server. This effectively hides the servers that are visible to the witness server, resulting in the GUI only connecting to servers in the cluster that are associated with the first server connected to. Note: This should be set only on the witness server.

Since the GUI connects only to servers in the cluster that are associated with the first server connected to, if that first server is the witness server, and HIDE_GUI_SYS_LIST is set to “true,” the GUI will not automatically connect to the other servers with established communication paths. As this behavior is not typical LifeKeeper GUI behavior, it may lead an installer to incorrectly conclude that there is a network or other configuration problem. To use the LifeKeeper GUI on a witness server with this setting, connect manually to one of the other nodes in the cluster, and the remaining nodes in the cluster will be shown in the GUI correctly.

Note: To prevent witness checks from being performed on all systems in all clusters, the witness_mode should always be set to none/off on shared, dedicated quorum witness nodes.

Adding a Witness Node to a Two-Node Cluster 

The following is an example of a two-node cluster utilizing the Quorum/Witness Server Support Package for LifeKeeper by adding a third “witness” node.

LifeKeeperPairConfigurationWitness.jpg

Simple Two-Node Cluster with Witness Node

Server A and Server B should already be set up with LifeKeeper core with resource hierarchies created on Server A and extended to Server B (Server W will have no resource hierarchies extended to it). Using the following steps, a third node will be added as the witness node.

  1. Set up the witness node, making sure network communications are available to the other two nodes.

  2. Install LifeKeeper core on the witness node and properly license/activate it.

  3. Install the Quorum/Witness Server Support Package on all three nodes.

  4. Create comm paths between all three nodes.

  5. Set desired quorum checking mode in /etc/default/LifeKeeper (majority, tcp_remote, none/off) (select majority for this example). See Available Quorum Modes for an explanation of these modes.

  6. Set desired witness mode in /etc/default/LifeKeeper (remote_verify, none/off). See Available Witness Modes for an explanation of these modes.

Expected Behaviors (Assuming Default Modes)

Scenario 1

Communications fail between Servers A and B

If the communications fail between Server A and Server B, the following will happen:

Scenario 2

Communications fail between Servers A and W

Since all nodes can and will act as witness nodes when the witness package is installed, this scenario is the same as the previous. In this case, Server A and Witness Server W will determine that the other is still alive by consulting with Server B.

Scenario 3

Communications fail between Server A and all other nodes (A fails)

In this case, Server B will do the following:

With B now acting as Source, communication resumes for Server A

Based on the previous scenario, Server A now resumes communications. Server B will process a comm_up event, determine that it has quorum (all three of the nodes are visible) and that it has the resources in service. Server A will process a comm_up event, determine that it also has quorum and that the resources are in service elsewhere. Server A will not bring resources in service at this time.

With B now acting as Source, Server A is powered on with communications to the other nodes

In this case, Server B will respond just like in the previous scenario, but Server A will process an lcm_avail event. Server A will determine that it has quorum and respond normally in this case by not bringing resources in service that are currently in service on Server B.

With B now acting as Source, Server A is powered on without communications

In this case, Server A will process an lcm_avail event and Servers B and W will do nothing since they can’t communicate with Server A. Server A will determine that it does not have quorum since it can only communicate with one of the three nodes. In the case of not having quorum, Server A will not bring resources in service.

Scenario 4

Communications fail between Server A and all other nodes (A's network fails but A is still running)

In this case, Server B will do the following:

Also, in this case, Server A will do the following:

of

© 2018 SIOS Technology Corp., the industry's leading provider of business continuity solutions, data replication for continuous data protection.