Normally, LifeKeeper will automatically switch operations to a backup node when a node failure or a resource failure occurs. However, depending on the environment, requiring manual confirmation by a system administrator may be desirable, instead of an automatic failover recovery initiated by LifeKeeper. In these cases, the Confirm Failover or Block Resource Failover settings are available. By using these functions, automatic failover can be blocked and a time to wait for failover can be set when a resource failure or a node failure occurs.

Set Confirm Failover or Block Resource Failover in your SPS environment after carefully reading the descriptions, examples, and considerations below. These settings are available from the Server Properties dialog of the GUI and via the command line of LifeKeeper.

Set Confirm Failover On

When a failover occurs because a node in the LifeKeeper cluster fails (Note: a node failure is identified by a failure of all LifeKeeper communication paths to that system), the time to wait before LifeKeeper switches resources to a backup node can be set with the Confirm Failover setting (see the discussion on the CONFIRMSOTO variable later in this document). Also, a user can decide whether to automatically switch to the backup node or not after the time to wait expires (see the discussion of the CONFIRMSODEF variable later in this document).

To enable the Confirm Failover setting via the GUI, use the General tab of the Server Properties dialog. An example of the General tab for Server Properties is shown below. The part outlined in red on the screen addresses the Confirm Failover setting.

In this example, the setting is seen from the host named lktestA. The part outlined in red on the screen is used for this setting Confirm Failover. The node names for the HA cluster are displayed vertically. In this example, the standby node for lktestA is lktestB.

The screen shows the configuration status for server lktestA, with the checkbox for lktestB set. In this case, the confirm failover flag is created on lktestB. When a failover from lktestA to lktestB is executed, the confirmation process for executing a failover occurs on lktestB. This process includes checking the default action to take based on the CONFIRMSODEF variable setting (see the discussion later in this document) and how long to wait before taking that action based on the CONFIRMSOTO variable setting.

The Confirm Failover flag creation status can be checked via the command line. When the checkbox for lktestB is set on the host named lktestA, the Confirm Failover flag is created on lktestB. (NOTE: In this example, the flag is not created on lktestA, only on lktestB.) An example of the command line output is below.

[root@lktestB~]# /opt/LifeKeeper/bin/flg_list

confirmso!lktestA

The “confirmso!lktestA” output is the result of the flg_list command, and indicates that the Confirm Failover flag is set on node lktestB to confirm IktestA failures.

When failover occurs with the confirmso flag, the following messages are recorded in the LifeKeeper log file.

INFO:lcd.recover:::004113:

chk_man_interv: Flag confirmso!hostname is set, issuing confirmso event and waiting for switchover instruction.

NOTIFY:event.confirmso:::010464:

LifeKeeper: FAILOVER RECOVERY OF MACHINE lktestA requires manual confirmation! Execute ‘/opt/LifeKeeper/bin/lk_confirmso -y -s lktestA ‘ to allow this failover, or execute ‘/opt/LifeKeeper/bin/lk_confirmso -n -s lktestA’ to prevent it. If no instruction is provided, LifeKeeper will timeout in 600 seconds and the failover will be allowed to proceed.

Execute one of the following commands to confirm the failover:

To proceed with the failover:

# /opt/LifeKeeper/bin/lk_confirmso -y -s hostname

To block the failover:

# /opt/LifeKeeper/bin/lk_confirmso -n -s hostname

The host name that is specified when executing the command is the host name listed in Confirm Failover flag which for this example would be lktestA. Execute the command by referring to the example commands provided in the Log output.

In the case where the set time to wait is exceeded, the default failover action is executed (allow failover or block failover). The default failover action is determined by the CONFIRMSODEF variable (see discussion later in this document).

The following message is output to the LifeKeeper log when the timeout expires.

lcdrecover[xxxx]: INFO:lcd.recover:::004408:chk_man_interv: Timed out waiting for instruction, using default CONFIRMSODEF value 0.

The LifeKeeper operation when the time to wait is exceeded is controlled by the setting of the variable “CONFIRMSODEF” which is set in the /etc/default/LifeKeeper file with a value of “1” or “0”. A value of “0” is set by default, and this indicates that the failover will proceed when the time to wait is exceeded. If the value is set to a “1”, the failover is blocked when the time to wait is exceeded.

The time to wait for confirmation of a failover can be changed by adjusting the value of the CONFIRMSOTO variable in the /etc/default/LifeKeeper file. The value of the variable specifies the number of seconds to wait for a manual confirmation from the user before proceeding or blocking the failover as determined by the value of the “CONFIRMSODEF” variable (see above).

Restarting LifeKeeper or rebooting the OS is not required for changes to these variables to take effect. If the value of CONFIRMSOTO is set to 0 seconds, then the operation based on the CONFIRMSODEF setting will occur immediately.

When to Select [Confirm Failover] Setting

This setting is used for Disaster Recovery or WAN configurations in the environment which the communication paths are not redundant.

  • Open the Properties page from one server and then select the server that you want the Confirm Failover flag to be set on.

Block Resource Failover On

The Block Resource Failover On setting blocks all resource transfers due to a resource failure from the given system.

By default, the recovery of resource failures in a local system (local recovery) is performed when a resource failure is detected. When the local recovery has failed or is not enabled, a failover is initiated to the next highest priority standby node defined for the resource. The Block Resource Failover On setting will prevent this failover attempt.

To enable the Block Resource Failover On setting by the GUI, use the General tab of the Server Properties. An example of the General tab for Server Properties is below. The part outlined in red on the screen addresses the setting Block Resource Failover On.

In this example, the setting is seen from the host named lktestA. The part outlined in red on the screen is used for setting Block Resource Failover. The node names for the HA cluster are displayed vertically here. In this example, the standby node for lktestA is lktestB.

In this case, the Block Resource Failover flag is created on lktestB. The “block_failover” flag can be verified on the command line by executing the flg_list command. An example of the output is below.

[root@lktestB~]# /opt/LifeKeeper/bin/flg_list

block_failover

When the block_failover flag is set, the failover to other node (IktestA) is blocked when a resource failure occurs on IkestB.

The block_failover flag prevents resource failovers from occurring on the node where the flag is set. The following log message is output to the LifeKeeper log when the failover is blocked by this setting.

ERROR:lcd.recover:::004787:Failover is blocked by current settings. MANUAL INTERVENTION IS REQUIRED

Configuration examples

Some configuration examples are described below.

Block All Automatic Failovers

In this example, the failover is blocked when a node failure or a resource failure is detected on either lktestA or lktestB. Use the Confirm failover and Block Resource failover settings for this. LKCLI allows you to configure the setting with just one command. Refer to lkcli server block-all-failovers for more information. The configuration example with the GUI is as follows.

  1. Select lktestA and view Server Properties.On the General tab, check the “Set Confirm Failover On” box for lktestB and the “Set Block Resource Failover On” box for both lktestA and lktestB. The setting status in the GUI is below.

The configuration of Server Properties for lktestA as displayed in the GUI once set.

Set Confirm Failover On Set Block Resource Failover On
lktestA
(Not checked)
lktestB

*When viewing the Server Properties in the GUI the node name can be found near the top of the properties panel display.

  1. Select lktestB and view Server Properties.

On General tab, check “Set Confirm Failover On” box for lktestA. The “Set Block Resource Failover On” property will already be set based on the actions taken in step 1.

The configuration of Server Properties for lktestB as displayed in the GUI once set.

Set Confirm Failover On Set Block Resource Failover On
lktestB
(Not checked)
lktestA

*When viewing the Server Properties in the GUI the node name can be found near the top of the properties panel display.

After completing these steps, confirm that the “confirmso!hostname” and “ block_failover” flags are set for each node using the flg_list command. For the confirmso flag, verify that the host name for which failover confirmation is to be performed is listed as part of the contents of the flag name (to block failover from lktestB on lktestA, lktestA should list lktestB in the contents of the confirmso flag name on lktestA, see the table below).

Confirm Failover Flag Block Resource Failover Flag
lktestA
confirmso!lktestB
block_failover
lktestB
confirmso!lktestA
block_failover
  1. Set the values for “CONFIRMSOTO” and “CONFIRMSODEF” in /etc/default/LifeKeeper on each node. (Restarting LifeKeeper or rebooting the OS is not required.)

CONFIRMSODEF=1

CONFIRMSOTO=0

When setting the time to wait value, it is specified in seconds via CONFIRMSOTO. For the default action to be taken on failover, the CONFIRMSODEF setting must be either 0 (failover is executed) or 1 (failover is blocked).

With the above settings any node failure will be immediately blocked without any operator intervention.

Block Failovers in One Direction

In this example, the failover to lktestB is blocked when a node failure or a resource failure is detected on lktestA. On the contrary, the failover to lktestA is allowed when a node failure or a resource failure is detected on lktestB.

  1. Select lktestA and view Server Properties.
  1. On the General tab, check “Set Confirm Failover On” box for lktestB and the “Set Block Resource Failover On” for lktestA.

The configuration of Server Properties for lktestA as displayed in the GUI once set.

Set Confirm Failover On Set Block Resource Failover On
lktestA
(Not checked)
lktestB
(Not checked)
  1. Select lktestB and view Server Properties.

In the General tab, the “Set Block Resource Failover On” box for lktestA should already be set (from the action taken on lktestA).

The configuration of Server Properties for lktestB as displayed in the GUI once set.

Set Confirm Failover On Set Block Resource Failover On
lktestB
(Not checked)
(Not checked)
lktestA
(Not checked)

*In the GUI, the local host name is listed first.

For this configuration, verify that the “confirmso!lktestA” flag is set on lktestB (no confirmso flag should be set on lktestA) and the block failover flag is set on lktestA.

Confirm Failover Flag Block Resource Failover Flag
lktestA
N/A
block_failover
lktestB
confirmso!lktestA
N/A
  1. Set the values for “CONFIRMSODEF” and “CONFIRMSOTO” in /etc/default/LifeKeeper on lktestB.

CONFIRMSODEF=1

CONFIRMSOTO=0

For this configuration resource and machine failovers from lktestA to lktestB are blocked. Resource and machine failovers from lktestB to lktestA are allowed.

Feedback

Was this helpful?

Yes No
You indicated this topic was not helpful to you ...
Could you please leave a comment telling us why? Thank you!
Thanks for your feedback.

Post your comment on this topic.

Post Comment