Normally, LifeKeeper will automatically switch operations to a backup node when a node failure or a resource failure occurs. However, depending on the environment, requiring manual confirmation by a system administrator may be desirable, instead of an automatic failover recovery initiated by LifeKeeper. In these cases, the Confirm Failover or Block Resource Failover settings are available. By using these functions, automatic failover can be blocked and a time to wait for failover can be set when a resource failure or a node failure occurs.
Set Confirm Failover or Block Resource Failover in your SPS environment after carefully reading the descriptions, examples, and considerations below. These settings are available from the Server Properties dialog of the GUI and via the command line of LifeKeeper.
Set Confirm Failover On
When a failover occurs because a node in the LifeKeeper cluster fails (Note: a node failure is identified by a failure of all LifeKeeper communication paths to that system), the time to wait before LifeKeeper switches resources to a backup node can be set with the Confirm Failover setting (see the discussion on the CONFIRMSOTO variable later in this document). Also, a user can decide whether to automatically switch to the backup node or not after the time to wait expires (see the discussion of the CONFIRMSODEF variable later in this document).
To enable the Confirm Failover setting via the GUI, use the General tab of the Server Properties dialog. An example of the General tab for Server Properties is shown below. The part outlined in red on the screen addresses the Confirm Failover setting.
In this example, the setting is seen from the host named lktestA. The part outlined in red on the screen is used for this setting Confirm Failover. The node names for the HA cluster are displayed vertically. In this example, the standby node for lktestA is lktestB.
The screen shows the configuration status for server lktestA, with the checkbox for lktestB set. In this case, the confirm failover flag is created on lktestB. When a failover from lktestA to lktestB is executed, the confirmation process for executing a failover occurs on lktestB. This process includes checking the default action to take based on the CONFIRMSODEF variable setting (see the discussion later in this document) and how long to wait before taking that action based on the CONFIRMSOTO variable setting.
The Confirm Failover flag creation status can be checked via the command line. When the checkbox for lktestB is set on the host named lktestA, the Confirm Failover flag is created on lktestB. (NOTE: In this example, the flag is not created on lktestA, only on lktestB.) An example of the command line output is below.
[root@lktestB~]# /opt/LifeKeeper/bin/flg_list
confirmso!lktestA
The “confirmso!lktestA” output is the result of the flg_list command, and indicates that the Confirm Failover flag is set on node lktestB to confirm IktestA failures.
When failover occurs with the confirmso flag, the following messages are recorded in the LifeKeeper log file.
INFO:lcd.recover:::004113:
chk_man_interv: Flag confirmso!hostname is set, issuing confirmso event and waiting for switchover instruction.
NOTIFY:event.confirmso:::010464:
LifeKeeper: FAILOVER RECOVERY OF MACHINE lktestA requires manual confirmation! Execute ‘/opt/LifeKeeper/bin/lk_confirmso -y -s lktestA ‘ to allow this failover, or execute ‘/opt/LifeKeeper/bin/lk_confirmso -n -s lktestA’ to prevent it. If no instruction is provided, LifeKeeper will timeout in 600 seconds and the failover will be allowed to proceed.
Execute one of the following commands to confirm the failover:
To proceed with the failover:
# /opt/LifeKeeper/bin/lk_confirmso -y -s hostname
To block the failover:
# /opt/LifeKeeper/bin/lk_confirmso -n -s hostname
The host name that is specified when executing the command is the host name listed in Confirm Failover flag which for this example would be lktestA. Execute the command by referring to the example commands provided in the Log output.
In the case where the set time to wait is exceeded, the default failover action is executed (allow failover or block failover). The default failover action is determined by the CONFIRMSODEF variable (see discussion later in this document).
The following message is output to the LifeKeeper log when the timeout expires.
lcdrecover[xxxx]: INFO:lcd.recover:::004408:chk_man_interv: Timed out waiting for instruction, using default CONFIRMSODEF value 0.
The LifeKeeper operation when the time to wait is exceeded is controlled by the setting of the variable “CONFIRMSODEF” which is set in the /etc/default/LifeKeeper file with a value of “1” or “0”. A value of “0” is set by default, and this indicates that the failover will proceed when the time to wait is exceeded. If the value is set to a “1”, the failover is blocked when the time to wait is exceeded.
The time to wait for confirmation of a failover can be changed by adjusting the value of the CONFIRMSOTO variable in the /etc/default/LifeKeeper file. The value of the variable specifies the number of seconds to wait for a manual confirmation from the user before proceeding or blocking the failover as determined by the value of the “CONFIRMSODEF” variable (see above).
Restarting LifeKeeper or rebooting the OS is not required for changes to these variables to take effect. If the value of CONFIRMSOTO is set to 0 seconds, then the operation based on the CONFIRMSODEF setting will occur immediately.
When to Select [Confirm Failover] Setting
This setting is used for Disaster Recovery or WAN configurations in the environment which the communication paths are not redundant.
- In a regular site (non-multi-site cluster), open the Properties page from one server and then select the server that you want the Confirm Failover flag to be set on.
- For a Multi-site WAN configuration: Enable manual failover confirmation by setting the Confirm Failover flag.
- For a Multi-site LAN configuration: Do not set the Confirm Failover flag to enable manual failover confirmation.
- In a multi-site cluster environment – from the non-disaster system, select the DR system and check the set confirm failover flag. Open the Properties panel and select this setting for each non-disaster server in the cluster.
Block Resource Failover On
The Block Resource Failover On setting blocks all resource transfers due to a resource failure from the given system.
By default, the recovery of resource failures in a local system (local recovery) is performed when a resource failure is detected. When the local recovery has failed or is not enabled, a failover is initiated to the next highest priority standby node defined for the resource. The Block Resource Failover On setting will prevent this failover attempt.
To enable the Block Resource Failover On setting by the GUI, use the General tab of the Server Properties. An example of the General tab for Server Properties is below. The part outlined in red on the screen addresses the setting Block Resource Failover On.
In this example, the setting is seen from the host named lktestA. The part outlined in red on the screen is used for setting Block Resource Failover. The node names for the HA cluster are displayed vertically here. In this example, the standby node for lktestA is lktestB.
In this case, the Block Resource Failover flag is created on lktestB. The “block_failover” flag can be verified on the command line by executing the flg_list command. An example of the output is below.
[root@lktestB~]# /opt/LifeKeeper/bin/flg_list
block_failover
When the block_failover flag is set, the failover to other node (IktestA) is blocked when a resource failure occurs on IkestB.
The block_failover flag prevents resource failovers from occurring on the node where the flag is set. The following log message is output to the LifeKeeper log when the failover is blocked by this setting.
ERROR:lcd.recover:::004787:Failover is blocked by current settings. MANUAL INTERVENTION IS REQUIRED
Conditions/Considerations
- In a multi-site configuration, do not select Block Failover for any server in the configuration.
- Important considerations in a multi-site cluster configuration: Do not check the Set Block Resource Failover On box.
Configuration examples
Some configuration examples are described below.
Block All Automatic Failovers
In this example, the failover is blocked when a node failure or a resource failure is detected on either lktestA or lktestB. Use the Confirm failover and Block Resource failover settings for this. The configuration example is below.
- Select lktestA and view Server Properties.On the General tab, check the “Set Confirm Failover On” box for lktestB and the “Set Block Resource Failover On” box for both lktestA and lktestB. The setting status in the GUI is below.
The configuration of Server Properties for lktestA as displayed in the GUI once set.
Set Confirm Failover On | Set Block Resource Failover On | |
---|---|---|
|
|
|
|
|
|
*When viewing the Server Properties in the GUI the node name can be found near the top of the properties panel display.
- Select lktestB and view Server Properties.
On General tab, check “Set Confirm Failover On” box for lktestA. The “Set Block Resource Failover On” property will already be set based on the actions taken in step 1.
The configuration of Server Properties for lktestB as displayed in the GUI once set.
Set Confirm Failover On | Set Block Resource Failover On | |
---|---|---|
|
|
|
|
|
|
*When viewing the Server Properties in the GUI the node name can be found near the top of the properties panel display.
After completing these steps, confirm that the “confirmso!hostname” and “ block_failover” flags are set for each node using the flg_list command. For the confirmso flag, verify that the host name for which failover confirmation is to be performed is listed as part of the contents of the flag name (to block failover from lktestB on lktestA, lktestA should list lktestB in the contents of the confirmso flag name on lktestA, see the table below).
Confirm Failover Flag | Block Resource Failover Flag | |
---|---|---|
|
|
|
|
|
|
- Set the values for “CONFIRMSOTO” and “CONFIRMSODEF” in /etc/default/LifeKeeper on each node. (Restarting LifeKeeper or rebooting the OS is not required.)
CONFIRMSODEF=1
CONFIRMSOTO=0
When setting the time to wait value, it is specified in seconds via CONFIRMSOTO. For the default action to be taken on failover, the CONFIRMSODEF setting must be either 0 (failover is executed) or 1 (failover is blocked).
With the above settings any node failure will be immediately blocked without any operator intervention.
Block Failovers in One Direction
In this example, the failover to lktestB is blocked when a node failure or a resource failure is detected on lktestA. On the contrary, the failover to lktestA is allowed when a node failure or a resource failure is detected on lktestB.
- Select lktestA and view Server Properties.
- On the General tab, check “Set Confirm Failover On” box for lktestB and the “Set Block Resource Failover On” for lktestA.
The configuration of Server Properties for lktestA as displayed in the GUI once set.
Set Confirm Failover On | Set Block Resource Failover On | |
---|---|---|
|
|
|
|
|
|
- Select lktestB and view Server Properties.
In the General tab, the “Set Block Resource Failover On” box for lktestA should already be set (from the action taken on lktestA).
The configuration of Server Properties for lktestB as displayed in the GUI once set.
Set Confirm Failover On | Set Block Resource Failover On | |
---|---|---|
|
|
|
|
|
|
*In the GUI, the local host name is listed first.
For this configuration, verify that the “confirmso!lktestA” flag is set on lktestB (no confirmso flag should be set on lktestA) and the block failover flag is set on lktestA.
Confirm Failover Flag | Block Resource Failover Flag | |
---|---|---|
|
|
|
|
|
|
- Set the values for “CONFIRMSODEF” and “CONFIRMSOTO” in /etc/default/LifeKeeper on lktestB.
CONFIRMSODEF=1
CONFIRMSOTO=0
For this configuration resource and machine failovers from lktestA to lktestB are blocked. Resource and machine failovers from lktestB to lktestA are allowed.
Post your comment on this topic.