Switch over of SAP HANA resource
If a switch over of the primary database instance is started, the recovery kit performs the following
steps:
- The database instance on the previous primary node will be stopped
- A takeover of the system replication will be executed on the previous secondary node
- The database instance on the new secondary node will be re-enabled in system replication to the new primary node
- The database on the new secondary node will be started
Stop SAP HANA resource
When a stop of the SAP HANA resource is executed, only the primary database instance is terminated. The secondary database instance is still running.
In the event that the primary and the secondary database instance are to be stopped, the “!volatile!noHANAremove_<tag>” flag must be removed first. This is done with the following commands:
Example:
vmlx-sha1:~ # flg_list
!volatile!noHANAremove_HANA-DB_HN1_00
vmlx-sha1:~ # flg_remove -f ‘!volatile!noHANAremove_HANA-DB_HN1_00’
vmlx-sha1:~ # flg_list
Now it is possible to stop the primary and the secondary database instance. When the SAP HANA resource is restarted, the “!volatile!noHANAremove_<tag>” flag is created automatically.
Test Scenarios
To understand the behavior of the recovery kit, perform the following tests. The following requirements must be fulfilled. The LifeKeeper and the SAP HANA database must be installed and configured according to the installation instructions. The SAP HANA database resource is active and running in one of the possible replication modes. The active replication mode must be entered correctly in the properties of the HANA DB resource. See section “Creating SAP HANA Resource”, point 8 and point 17. A dependency to a prior generated protected IP address must be created.
Test the recovery of the SAP Host Agent
Determine the status and the process numbers of the SAP Host Agent processes by using:
vlmx-sha1:~ # /usr/sap/hostctrl/exe/saphostexec –status
saphostexec running (pid = 3818)
sapstartsrv running (pid = 3867)
11:30:49 17.11.2016 LOG: Using PerfDir (DIR_PERF) = /usr/sap/tmp
saposcol running (pid = 3965)
Kill one of the processes or execute
/usr/sap/hostctrl/exe/saphostexec –stop
to influence the functionality of the SAP Host Agent. The SAP HANA recovery kit will recognize that the SAP Host Agent is not working properly and restart the SAP Host Agent on that node. The behavior can be observed by the following command:
tail -f /var/log/lifekeeper.log
The SAP HANA resource does not change the state. After a successful recovery, the SAP Host Agent is fully functional again. If the recovery kit is unable to restart the SAP Host agent, the HANA database and the resource remains in the previous state. The SAP Host Agent will be checked again and if possible restarted later.
Test the recovery of sapstartsrv of the SAP HANA instance
To test the recovery of sapstartsrv, the service must be stopped. One method is to terminate the sapstartsrv with the command
sapcontrol -nr <ID> -function Stop
as user <sid>adm. The other method is to kill the sapstartsrv process. In this case the sapstartsrv will be restarted by SAP HANA recovery kit. The resource does not change its state.
Test the recovery of the secondary SAP HANA DB (replication target)
In event of a failure of the secondary database instance (replication target), the replication mode of the database instance on that node is determined. When the replication mode is set correctly, the database instance will be started again. If the mode is set to primary, the recovery kit will enable the correct replication mode and start the database instance.
The behavior can be observed by looking at the log file /var/log/lifekeeper.log. After the recovery, the state of the database and the replication can be tested as user <sid>adm by using:
sapcontrol -nr 0 -function GetProcessList
hdbnsutil -sr_state
In the event that the secondary database instance cannot be started by the recovery kit, the SAP HANA resource is flagged as faulty (OSF) on the corresponding node.
If the cause of an unsuccessful start is eliminated, the SAP HANA recovery kit starts the database instance in the subsequent check cycle. The resource will be flagged as standby at the corresponding node.
Test the recovery of the primary SAP HANA DB
In the event of a failure of the primary database instance (replication source), the replication mode of the database instance on the primary node is determined. When the replication mode is set to primary, the database instance will be started again. If the mode is not set to primary, the recovery kit will take over the replication mode and start the database instance. The behavior can be observed by looking at the log file /var/log/lifekeeper.log. After the recovery, the state of the database and the replication can be tested as user <sid>adm by using:
sapcontrol -nr 0 -function GetProcessList
hdbnsutil -sr_state
In the event that the primary database instance cannot be started by the recovery kit at that node, LifeKeeper will initiate a failover of the entire hierarchy to the former secondary node. On this node the HANA recovery kit performs a takeover of the replication mode. This node becomes the primary node of replication. For the faulty node, a re-enable to the primary replication node is started. If this is successful, the secondary database is also restarted. In event of a failed re-enable the replication mode or a failed restart of the database instance on secondary node, the SAP HANA resource will be flagged as failed (OSF) on the corresponding node. A manual intervention is necessary to eliminate the cause. If the failover of the primary database instance failed, the resource is flagged as faulty (OSF) and remains in this state.
Test a node failure of the secondary node (reboot -f, power off)
If an error causes the secondary node to fail, the resource remains on the active node. The replication is disrupted. If the secondary is restarted and LifeKeeper is active, the database instance is restarted as a replication target.
Test a failover in case of a node failure of the primary node (reboot -f, power off)
If an error causes the primary node to fail, the remaining node is starting a takeover of resources. The HANA recovery kit also takes the role of primary source of replication. If the faulty node restarted and LifeKeeper is active, the database instance is restarted as a replication target on that node.
Post your comment on this topic.