Quorum checking is performed via SPS for Linux communication paths. A node has quorum when it is able to communicate with the majority of the nodes in the cluster.This quorum mode is available on clusters with three or more nodes. A node dedicated for witness checking needs to be added when using a two-node configuration.

Majority Mode Configuration

Set QUORUM_MODE to majority in /etc/default/LifeKeeper. No other setting is required for this mode.

Available Witness mode settings for Majority Mode

The following witness modes are available for majority mode. For details on each mode, please refer to “Available Witness Mode”.

  • remote_verify
  • none/off

Expected Behaviors for Majority Mode (Assuming Default Modes)

The scenarios listed below shows the SPS for Linux behavior of a three-node cluster with Node A (resources are in-service), Node B (resources are on stand-by), and Node W (a witness-only node without protected resources).

The following three events may change the resource status on a node failure.

  • COMM_DOWN event
    An event called when all the communication paths between nodes are disconnected.
  • COMM_UP event
    An event called when communication paths are recovered from a COMM_DOWN state.
  • LCM_AVAIL event
    An event called after LCM initialization is completed and it is called only once when starting LifeKeeper. Once this state has been reached heartbeat transmission to other nodes in the cluster begins over the established communication paths. It also ready to receive heartbeat requests from other nodes cluster. LCM_AVAIL is always processed before processing a COMM_UP event.

Scenario 1

A communication path fails between Node A and B

In this case, the following will happen:

  1. Both Node A and Node B will begin processing COMM_DOWN events, though not necessarily at exactly the same time.

  2. Both nodes will perform the quorum check and determine that they still have quorum (since both Node A and B can see Node W and they have communication with two of the three known nodes, they think that they are in the majority).

  3. Each will consult the other nodes with whom they can still communicate about the true status of the server with whom they’ve lost communications (witness checking). In this scenario, this means that Node A will consult Node W about Node B’s status and Node B will also consult Node W about Node A’s status.

  4. Node A and Node B will both determine that the other is still alive by having consulted Node W and no failover processing will occur. Resources will be left in service on Node A.

Scenario 2

A communication path fails between Node A and Node W

Since all nodes can and will act as witness nodes when the quorum/witness package is installed, this scenario is the same as the previous. In this case, Node A and Node W will determine that the other is still alive by consulting with Node B.

Scenario 3

Node A fails and stops

In this case, Node B will do the following:

  1. Begin processing the COMM_DOWN event from Node A.

  2. Determine that it can still communicate with Node W and thus has quorum.

  3. Verify via Node W that Node A really appears to be lost and, begin the usual failover activity.

  4. Node B will continue processing the event and bring the protected resources in service.

With resources being in-service on Node B, Node A is powered on and establishes communications with the other nodes

In this case, Node A will process an LCM_AVAIL event. Node A will determine that it has quorum and not bring resources in service because they are currently in service on Node B. Next, a COMM_UP event will be processed between Node A and Node B and also between Node A and Node W (processed twice at node A). Each node will determine that it has quorum during the COMM_UP events and will not bring resources in service because they are currently in service on Node B.

With resources being in-service on Node B, Node A is powered on and cannot establish communications to the other nodes

In this case, Node A will process an LCM_AVAIL event and Node B and Node W will do nothing since they can’t communicate with Node A. Node A will determine that it does not have quorum since it can only communicate with one of the three nodes (Node A itself). Because it does not have quorum, Node A will not bring resources in service.

Scenario 4

A failure occurs with the network for Node A (Node A is running without communications to other nodes)

In this case, Node A will do the following:

  1. Begin processing a COMM_DOWN event from Node B (processing of a COMM_DOWN event from Node W is started almost simultaneously).

  2. Determine that it cannot communicate with Node B or Node W and thus does not have quorum.

  3. Immediately force-quit (“fastkill”, default behavior of QUORUM_LOSS_ACTION).

Node B will do the following:

  1. Begin processing a COMM_DOWN event from Node A.

  2. Determine that it can still communicate with Node W and thus has quorum.

  3. Verify via Node W that Node A really appears to be lost (witness checking) and, begin the usual failover activity.

  4. Node B will now have the protected resources in service.

With resources being in-service at Node B, communication resumes for Node A

In this case, Node B will process a COMM_UP event, determine that it has quorum (all three of the nodes are visible) and that it has the resources in service. Node A will process a COMM_UP event, determine that it also has quorum and that the resources are in service on Node B. Node A will not bring resources in service at this time.

Feedback

Thanks for your feedback.

Post your comment on this topic.

Post Comment