Testing your SAP HANA Resource Hierarchy

Test Scenarios

To understand the behavior of the SAP HANA Recovery Kit, perform the following tests. The following prerequisites must be completed before performing any test:

LifeKeeper and the SAP HANA database must be installed and configured according to the installation instructions provided by SIOS and SAP.

SAP HANA System Replication must be enabled and active on all servers in the cluster, with the secondary replication site registered using one of the valid replication modes (sync, syncmem, or async) and operation modes (delta_datashipping, logreplay, or logreplay_readaccess). See Configure SAP HANA System Replication for more details.

If managing the switchable IP address associated with the SAP HANA database with a LifeKeeper IP resource, there must exist a dependency of the SAP HANA resource on the IP resource. See Step 7 in Creating an SAP HANA Resource Hierarchy for more details.

Test Recovery of SAP Host Agent

Determine the status and the process ID’s of the SAP Host Agent processes by using:

# /usr/sap/hostctrl/exe/saphostexec –status

saphostexec running (pid = 3818)

sapstartsrv running (pid = 3867)

saposcol running (pid = 3965)

Either manually kill one of the processes listed in the output or execute

/usr/sap/hostctrl/exe/saphostexec –stop

to impair the functionality of SAP Host Agent. The SAP HANA Recovery Kit will recognize that SAP Host Agent is not working properly and restart it on that node. The behavior can be observed by monitoring the LifeKeeper log with the following command:

tail -f /var/log/lifekeeper.log

During this recovery process, the SAP HANA resource does not change its state. After a successful recovery, SAP Host Agent is fully functional again. If the recovery kit is unable to restart SAP Host Agent, the HANA database and the resource remains in its current state. SAP Host Agent will be checked again and if possible restarted later.

Test Recovery of sapstartsrv for the SAP HANA Instance

To test the recovery of the SAP Start Service (sapstartsrv) for the SAP HANA instance, the service must be stopped. One method to stop sapstartsrv is by executing the sapcontrol StopService webmethod:

su – <sid>adm -c “sapcontrol -nr <Inst#> -function StopService”

where <sid> is the lower-case SAP System ID for the HANA installation and <Inst#> is the HDB instance number. Another method is to kill the sapstartsrv process directly. In either case, sapstartsrv will be restarted by SAP HANA Recovery Kit. The resource does not change its state as long as sapstartsrv is able to be restarted successfully.

Test Recovery of the Secondary SAP HANA DB (Replication Target)

In the event of a failure of the secondary database instance (replication target) or if the secondary replication site is unregistered in SAP HANA System Replication, the recovery kit will re-register the secondary site with the previous replication and operation modes and restart the secondary database instance.

To induce such a failure, execute one of the following commands on the secondary replication site:

su – <sid>adm -c “sapcontrol -nr <Inst#> -function Stop”

su – <sid>adm -c “hdbnsutil -sr_unregister”

The behavior can be observed by monitoring the log file /var/log/lifekeeper.log. After the recovery, the state of the database instance and SAP HANA System Replication can be tested by running the following commands on the secondary replication site:

su – <sid>adm -c “sapcontrol -nr <Inst#> -function GetProcessList”

su – <sid>adm -c “hdbnsutil -sr_state”

In the event that the secondary database instance cannot be started by the recovery kit, the SAP HANA resource is flagged as Failed (OSF) on the corresponding node.

Once the cause of an unsuccessful start is fixed by an administrator, the SAP HANA Recovery Kit will start the database instance in the subsequent quickCheck cycle. Once started successfully, the resource state will be updated to Standby (OSU) on the corresponding node.

Test Recovery of the Primary SAP HANA DB

In the event of a failure of the primary database instance (replication source), the replication mode of the database instance on the primary node is determined. If the replication mode is set to primary, the database instance will be started again. If the mode is not set to primary, the recovery kit will log a warning stating that the replication mode has been changed outside of LifeKeeper and suspend all monitoring of the SAP HANA resource until the issue is resolved. In the latter case, manual intervention is required to bring the HANA resource hierarchy in-service on the correct primary system. The behavior in this case can be observed in the LifeKeeper GUI, which will show the state “Active – HSR Disabled”, “Active – Unknown Repl Mode”, or “Active – Secondary” for the resource on the primary node, or by monitoring the log file /var/log/lifekeeper.log.

A failure of the primary database instance can be induced by running the following command on the primary replication site:

su – <sid>adm -c “sapcontrol -nr <Inst#> -function Stop”

After the recovery, the state of the database and the replication can be tested by using:

su – <sid>adm -c “sapcontrol -nr <Inst#> -function GetProcessList”

su – <sid>adm -c “hdbnsutil -sr_state”

In the event that the primary database instance cannot be started by the recovery kit on that node, LifeKeeper will initiate a failover of the entire hierarchy to the secondary node. On this node, the HANA Recovery Kit performs a takeover of SAP HANA System Replication and the previous secondary node becomes the new primary node for replication. LifeKeeper will attempt to re-register the faulty node as the secondary replication site using the previous replication and operation modes. If this is successful, the secondary database is also restarted. In the event that either the secondary node cannot be successfully registered as the secondary replication site or that the database cannot be successfully restarted on the secondary node, the HANA resource will be flagged as Failed (OSF) on the corresponding node. At this point, manual intervention is typically necessary to eliminate the cause of the failure. If the failover of the primary database instance failed, the resource is flagged as faulty Failed (OSF) and remains in this state until a manual in-service operation is performed by an administrator.

Test Machine Failure of the Secondary Node (reboot -f, power off)

If an error causes the secondary node to fail, the resource remains Active (ISP) on the primary node but SAP HANA System Replication is disrupted. Once the secondary node is restarted and LifeKeeper is active, the secondary database instance is automatically restarted as a replication target.

Test Machine Failure of the Primary Node (reboot -f, power off)

If an error causes the primary node to fail, a failover of the HANA resource hierarchy to the secondary node is initiated. A takeover of SAP HANA System Replication is performed on the secondary node and the previous secondary replication site becomes the new primary replication site. Once the faulty node is restarted and LifeKeeper is active, the node is registered as a secondary replication site and the database instance is automatically restarted as a replication target.

Deleting an SAP HANA Resource Hierarchy

SAP HANA Resource Hierarchy Administration

Feedback

Post your comment on this topic.