Quorum checking is performed via LifeKeeper communication paths. A node has quorum when it is able to communicate with the majority of the nodes in the cluster. This quorum mode is recommended for clusters with an odd number of nodes.

Majority Mode Configuration

Set QUORUM_MODE to majority in %LKROOT%/etc/default/LifeKeeper. The WITNESS_MODE should be set to “remote_verify” when you set QUORUM_MODE to majority.

Witness Mode Settings for Majority Mode

When using the majority QUORUM_MODE, you must set WITNESS_MODE to remote_verify.

Expected Behaviors for Majority Mode (Assuming Default Modes)

The scenarios illustrated below explain the behavior of a three-node cluster with Node A (resources are in-service), Node B (resources are on stand-by), and Node C (resources are on stand-by).

The following three events may change the resource status on a node failure.

  • COMM_DOWN event
    An event called when all the communication paths between nodes are disconnected.
  • COMM_UP event
    An event called when communication paths are recovered from a COMM_DOWN state.
  • LCM_AVAIL event
    An event called after LCM initialization is completed and it is called only once when starting LifeKeeper. Once this state has been reached, heartbeat transmission to other nodes in the cluster begins over the established communication paths. It also ready to receive heartbeat requests from other nodes cluster. LCM_AVAIL is always processed before processing a COMM_UP event.

Scenario 1

A communication path fails between Node A and B

In this case, the following will happen:

  1. Both Node A and Node B will begin processing COMM_DOWN events, though not necessarily at exactly the same time.
  2. Both nodes will perform the quorum check and determine that they still have quorum (since both Node A and B can see Node C and they have communication with two of the three known nodes, they think that they are in the majority).
  3. Each will consult the other nodes with whom they can still communicate about the true status of the server with whom they’ve lost communications (witness checking). In this scenario, this means that Node A will consult Node C about Node B’s status and Node B will also consult Node C about Node A’s status.
  4. Node A and Node B will both determine that the other is still alive by having consulted Node C and no failover processing will occur. Resources will be left in service on Node A.

Scenario 2

Node A fails and stops

In this case, Node B will do the following:

  1. Begin processing the COMM_DOWN event from Node A.
  2. Determine that it can still communicate with Node C and thus has quorum.
  3. Verify via Node C that Node A really appears to be lost and begin the usual failover activity.
  4. Whichever node has the lowest equivalency value will continue processing the event and bring the protected resources in service.

Node C will do the following:

  1. Begin processing the COMM_DOWN event from Node A.
  2. Determine that it can still communicate with Node B and thus has quorum.
  3. Verify via Node B that Node A really appears to be lost and begin the usual failover activity.
  4. Whichever node has the lowest equivalency value will continue processing the event and bring the protected resources in service.

With resources being in-service on Node B, Node A is powered on and establishes communications with the other nodes

In this case, Node A will process an LCM_AVAIL event. Node A will determine that it has quorum and not bring resources in service because they are currently in service on Node B. Next, a COMM_UP event will be processed between Node A and Node B and also between Node A and Node C (processed twice at Node A). Each node will determine that it has quorum during the COMM_UP events and will not bring resources in service because they are currently in service on Node B.

With resources being in-service on Node B, Node A is powered on and cannot establish communications to the other nodes

In this case, Node A will process an LCM_AVAIL event and Node B and Node C will do nothing since they can’t communicate with Node A. Node A will determine that it does not have quorum since it can only communicate with one of the three nodes (Node A itself). Because it does not have quorum, Node A will not bring resources in service.

Scenario 3

A failure occurs with the network for Node A (Node A is running without communications to other nodes)

In this case, Node A will do the following:

  1. Begin processing a COMM_DOWN event from Node B (processing of a COMM_DOWN event from Node C is started almost simultaneously).
  2. Determine that it cannot communicate with Node B or Node C and thus does not have quorum.
  3. LifeKeeper takes action based on the QUORUM_LOSS_ACTION (osu) and takes all resources out of service.
  4. LifeKeeper pauses communication with all nodes for a period of time specified by QUORUM_QUARANTINE_SECS.
  5. After the quarantine period expires, LifeKeeper resumes communication.

Node B will do the following:

  1. Begin processing a COMM_DOWN event from Node A.
  2. Determine that it can still communicate with Node C and thus has quorum.
  3. Verify via Node C that Node A really appears to be lost (witness checking) and begin the usual failover activity.
  4. Whichever node has the lowest equivalency value will continue processing the event and bring the protected resources in service.

Node C will do the following:

  1. Begin processing a COMM_DOWN event from Node A.
  2. Determine that it can still communicate with Node B and thus has quorum.
  3. Verify via Node B that Node A really appears to be lost (witness checking) and begin the usual failover activity.
  4. Whichever node has the lowest equivalency value will continue processing the event and bring the protected resources in service.

With resources being in-service at Node B, communication resumes for Node A

In this case, Node B will process a COMM_UP event, determine that it has quorum (all three of the nodes are visible) and that it has the resources in service. Node A will process a COMM_UP event, determine that it also has quorum and that the resources are in service on Node B. Node A will not bring resources in service at this time.

Scenario 4

All three nodes lose communications with each other

In this case, Node A will do the following:

  1. Begin processing COMM_DOWN events between node B. (Processing of a COMM_DOWN event from Node C is started almost simultaneously).
  2. Determine that it cannot communicate with Node B or Node C and thus does not have quorum.
  3. LifeKeeper takes action based on the QUORUM_LOSS_ACTION.

Node B will do the following:

  1. Begin processing a COMM_DOWN event between Node A. (Processing of a COMM_DOWN event from Node C is started almost simultaneously).
  2. Determine that it cannot communicate with Node A or Node C and thus does not have quorum.
  3. Since it does not have the resources in service, no QUORUM_LOSS_ACTION will occur.

Node C will do the following:

  1. Begin processing a COMM_DOWN event between Node A. (Processing of a COMM_DOWN event from Node B is started almost simultaneously).
  2. Determine that it cannot communicate with Node A or Node B and thus does not have quorum.
  3. Since it does not have the resources in service, no QUORUM_LOSS_ACTION will occur.

If all the communication paths are recovered, Node A will bring the resources in service. The following requirements should be met for this behavior.

  • As initialization behavior, AUTORES_ISP is set for the resources on Node A.
  • The Resource Priority value is the highest on Node A.

Feedback

Was this helpful?

Yes No
You indicated this topic was not helpful to you ...
Could you please leave a comment telling us why? Thank you!
Thanks for your feedback.

Post your comment on this topic.

Post Comment