Test Scenarios
To understand the behavior of the SAP HANA Recovery Kit, perform the following tests. The following prerequisites must be completed before performing any test:
- LifeKeeper and the SAP HANA database must be installed and configured according to the installation instructions provided by SIOS and SAP.
- SAP HANA System Replication must be enabled and active on all servers in the cluster, with the secondary replication site registered using one of the valid replication modes (sync, syncmem, or async) and operation modes (delta_datashipping, logreplay, or logreplay_readaccess). See Configure SAP HANA System Replication for more details.
- If managing the switchable IP address associated with the SAP HANA database with a LifeKeeper IP resource, there must exist a dependency of the SAP HANA resource on the IP resource. See Step 7 in Creating an SAP HANA Resource Hierarchy for more details.
Test Recovery of SAP Host Agent
Determine the status and the process ID’s of the SAP Host Agent processes by using:
font face=“Courier New”># /usr/sap/hostctrl/exe/saphostexec –status
saphostexec running (pid = 3818)
sapstartsrv running (pid = 3867)
saposcol running (pid = 3965)
Either manually kill one of the processes listed in the output or execute
/usr/sap/hostctrl/exe/saphostexec –stop
to impair the functionality of SAP Host Agent. The SAP HANA Recovery Kit will recognize that SAP Host Agent is not working properly and restart it on that node. The behavior can be observed by monitoring the LifeKeeper log with the following command:
tail -f /var/log/lifekeeper.log
During this recovery process, the SAP HANA resource does not change its state. After a successful recovery, SAP Host Agent is fully functional again. If the recovery kit is unable to restart SAP Host Agent, the HANA database and the resource remains in its current state. SAP Host Agent will be checked again and if possible restarted later.
Test Recovery of sapstartsrv for the SAP HANA Instance
To test the recovery of the SAP Start Service (sapstartsrv) for the SAP HANA instance, the service must be stopped. One method to stop sapstartsrv is by executing the sapcontrol StopService webmethod:
su – <sid>adm -c “sapcontrol -nr <Inst#> -function StopService”
where <sid> is the lower-case SAP System ID for the HANA installation and <Inst#> is the HDB instance number. Another method is to kill the sapstartsrv process directly. In either case, sapstartsrv will be restarted by SAP HANA Recovery Kit. The resource does not change its state as long as sapstartsrv is able to be restarted successfully.
Test Recovery of the Secondary SAP HANA DB (Replication Target)
In the event of a failure of the secondary database instance (replication target) or if the secondary replication site is unregistered in SAP HANA System Replication, the recovery kit will re-register the secondary site with the previous replication and operation modes and restart the secondary database instance.
To induce such a failure, execute one of the following commands on the secondary replication site:
su – <sid>adm -c “sapcontrol -nr <Inst#> -function Stop”
su – <sid>adm -c “hdbnsutil -sr_unregister”
The behavior can be observed by monitoring the log file /var/log/lifekeeper.log. After the recovery, the state of the database instance and SAP HANA System Replication can be tested by running the following commands on the secondary replication site:
su – <sid>adm -c “sapcontrol -nr <Inst#> -function GetProcessList”
su – <sid>adm -c “hdbnsutil -sr_state”
In the event that the secondary database instance cannot be started by the recovery kit, the SAP HANA resource is flagged as Failed (OSF) on the corresponding node.
Once the cause of an unsuccessful start is fixed by an administrator, the SAP HANA Recovery Kit will start the database instance in the subsequent quickCheck cycle. Once started successfully, the resource state will be updated to Standby (OSU) on the corresponding node.
Test Recovery of the Primary SAP HANA DB
In the event of a failure of the primary database instance (replication source), the replication mode of the database instance on the primary node is determined. If the replication mode is set to primary, the database instance will be started again. If the mode is not set to primary, the recovery kit will log a warning stating that the replication mode has been changed outside of LifeKeeper and suspend all monitoring of the SAP HANA resource until the issue is resolved. In the latter case, manual intervention is required to bring the HANA resource hierarchy in-service on the correct primary system. The behavior in this case can be observed in the LifeKeeper GUI, which will show the state “Active – HSR Disabled”, “Active – Unknown Repl Mode”, or “Active – Secondary” for the resource on the primary node, or by monitoring the log file /var/log/lifekeeper.log.
A failure of the primary database instance can be induced by running the following command on the primary replication site:
su – <sid>adm -c “sapcontrol -nr <Inst#> -function Stop”
After the recovery, the state of the database and the replication can be tested by using:
su – <sid>adm -c “sapcontrol -nr <Inst#> -function GetProcessList”
su – <sid>adm -c “hdbnsutil -sr_state”
In the event that the primary database instance cannot be started by the recovery kit on that node, LifeKeeper will initiate a failover of the entire hierarchy to the secondary node. On this node, the HANA Recovery Kit performs a takeover of SAP HANA System Replication and the previous secondary node becomes the new primary node for replication. LifeKeeper will attempt to re-register the faulty node as the secondary replication site using the previous replication and operation modes. If this is successful, the secondary database is also restarted. In the event that either the secondary node cannot be successfully registered as the secondary replication site or that the database cannot be successfully restarted on the secondary node, the HANA resource will be flagged as Failed (OSF) on the corresponding node. At this point, manual intervention is typically necessary to eliminate the cause of the failure. If the failover of the primary database instance failed, the resource is flagged as faulty Failed (OSF) and remains in this state until a manual in-service operation is performed by an administrator.
Test Machine Failure of the Secondary Node (reboot -f, power off)
If an error causes the secondary node to fail, the resource remains Active (ISP) on the primary node but SAP HANA System Replication is disrupted. Once the secondary node is restarted and LifeKeeper is active, the secondary database instance is automatically restarted as a replication target.
Test Machine Failure of the Primary Node (reboot -f, power off)
If an error causes the primary node to fail, a failover of the HANA resource hierarchy to the secondary node is initiated. A takeover of SAP HANA System Replication is performed on the secondary node and the previous secondary replication site becomes the new primary replication site. Once the faulty node is restarted and LifeKeeper is active, the node is registered as a secondary replication site and the database instance is automatically restarted as a replication target.
Additional Test Cases
Before putting a highly-available SAP HANA cluster into production, it is very important that common failure and recovery scenarios have been thoroughly tested. The test cases provided on this page are meant to be used as a starting point when developing a comprehensive test plan for your highly-available SAP HANA cluster deployment. The following example values will be used throughout:
Primary Server Host Name | node1 |
---|---|
Standby Server Host Name | node2 |
SAP SID | SPS |
SAP HANA Database Instance | HDB00 |
SAP HANA LifeKeeper Resource Tag Name | HANA-SPS_HDB00 |
HANA System Replication Primary Site Name | SiteA |
HANA System Replication Secondary Site Name | SiteB |
HANA System Replication Mode | sync |
HANA System Replication Operation Mode | logreplay |
When testing, these sample values must be adapted to fit the environment where the tests are being performed.
Manual Switchover
The test cases in this section ensure that manual switchovers can be performed successfully.
Manual Switchover Test |
---|
Description |
The SAP HANA resource hierarchy can be manually switched over from the primary server to the standby server. |
Preconditions |
Before performing this test, ensure that the following conditions are met:
|
Test Steps |
|
Expected Results |
|
Handshake Takeover Test Note: This test case requires LifeKeeper v9.5.2 or later. |
---|
Description |
The SAP HANA resource hierarchy can be manually switched over from the primary server to the standby server by using the SAP HANA “Takeover with Handshake” feature. |
Preconditions |
Before performing this test, ensure that:
|
Test Steps |
|
Expected Results |
|
Graceful Shutdown
The test cases in this section verify the expected behavior of the SAP HANA resource hierarchy when each server is gracefully rebooted.
Primary Server Reboot Test |
---|
Description |
The SAP HANA resource hierarchy behaves as expected when the primary server is gracefully rebooted. |
Preconditions |
Before performing this test, ensure that:
|
Test Steps |
|
Expected Results |
|
Standby Server Reboot Test |
---|
Description |
The SAP HANA resource hierarchy behaves as expected when the standby server is gracefully rebooted. |
Preconditions |
Before performing this test, ensure that:
|
Test Steps |
|
Expected Results |
|
Machine Failover
The test case in this section verifies the expected behavior of the SAP HANA resource hierarchy when the primary server is forcefully rebooted.
Machine Failover Test |
---|
Description |
The SAP HANA resource hierarchy behaves as expected when the primary server is forcefully rebooted. |
Preconditions |
Before performing this test, ensure that:
|
Test Steps |
|
Expected Results |
Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’. |
SAP Host Agent Failure
The test cases in this section verify the expected behavior of the SAP HANA resource hierarchy when supporting SAP Host Agent-related processes fail on each server.
Primary Server SAP Host Exec Failure Test |
---|
Description |
The SAP HANA resource hierarchy behaves as expected when the saphostexec process is killed on the primary server. |
Preconditions |
Before performing this test, ensure that: Both servers (node1 and node2) are operational, The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and HANA system replication is in-sync. |
Test Steps |
Stop the saphostexec process on node1: [root@node1 ~]# /usr/sap/hostctrl/exe/saphostexec -stop |
Expected Results |
|
Standby Server SAP Host Exec Failure Test |
---|
Description |
The SAP HANA resource hierarchy behaves as expected when the saphostexec process is killed on the standby server. |
Preconditions |
Before performing this test, ensure that:
|
Test Steps |
|
Expected Results |
|
Primary Server SAP OS Collector Failure Test |
---|
Description |
The SAP HANA resource hierarchy behaves as expected when the saposcol process is killed on the primary server. |
Preconditions |
Before performing this test, ensure that:
|
Test Steps |
Stop the saposcol process on node1: [root@node1 ~]# /usr/sap/hostctrl/exe/saposcol -k |
Expected Results |
|
Standby Server SAP OS Collector Failure Test |
---|
Description |
The SAP HANA resource hierarchy behaves as expected when the saposcol process is killed on the standby server. |
Preconditions |
Before performing this test, ensure that:
|
Test Steps |
Stop the saposcol process on node2: [root@node2 ~]# /usr/sap/hostctrl/exe/saposcol -k |
Expected Results |
|
Database Failure
The test cases in this section verify the expected behavior of the SAP HANA resource hierarchy when processes within the protected HANA database instance fail.
Notes:
- Some of the expected results in this section depend on whether local recovery is enabled or disabled for the SAP HANA resource. Local recovery is enabled by default. See Setting Local and Temporal Recovery Policies for SAP HANA Resources for more details.
- If the database instance is stopped gracefully when performing these tests (e.g., with an HDB stop command), the background process which gracefully stops the database may conflict with the SAP HANA Recovery Kit’s attempt to restart the database locally and may lead to a failover of the SAP HANA resource hierarchy. For this reason, we recommend simulating a crash of the HDB instance by forcefully and immediately killing the database processes at the operating system level, for example with the HDB kill-9 command. See SAP HANA – Known Issues / Restrictions for more details.
Primary Server Database Instance Failure Test |
---|
Description |
The SAP HANA resource hierarchy behaves as expected when the HDB instance processes are killed on the primary server. |
Preconditions |
Before performing this test, ensure that:
|
Test Steps |
As the HANA administrative user (<sid>adm), forcefully kill the HDB instance processes (at the operating system level) on node1: [root@node1 ~]# su – spsadm -c “HDB kill-9” |
Expected Results |
|
Standby Server Database Instance Failure Test |
---|
Description |
The SAP HANA resource hierarchy behaves as expected when the HDB instance processes are killed on the standby server. |
Preconditions |
Before performing this test, ensure that:
|
Test Steps |
As the HANA administrative user (<sid>adm), forcefully kill the HDB instance processes (at the operating system level) on node2: [root@node2 ~]# su – spsadm -c “HDB kill-9” |
Expected Results |
|
Appendix: Useful SAP HANA Administrative Commands
While the status of the SAP HANA environment may be monitored through SAP-provided dashboards (e.g., HANA Studio or HANA Cockpit), the following commands may also be useful while testing. Throughout, <sid> denotes the lowercase SAP SID for the protected SAP HANA database installation and <InstNum> denotes the instance number of the protected HDB instance (e.g., for instance HDB00, <InstNum> is 00).
Command | Description |
---|---|
su – <sid>adm -c “sapcontrol -nr <InstNum> -function StopService” | Stop the sapstartsrv process for the HDB instance. |
su – <sid>adm -c “sapcontrol -nr <InstNum> -function StartService <SID>” | Start the sapstartsrv process for the HDB instance. |
su – <sid>adm -c “sapcontrol -nr <InstNum> -function GetProcessList” | View current status of HDB instance processes. |
su – <sid>adm -c “HDB stop” | Gracefully stop the HDB instance. |
su – <sid>adm -c “HDB kill-9” | Forcefully kill the HDB instance processes. |
su – <sid>adm -c “HDB start” | Start the HDB instance. |
su – <sid>adm -c “hdbnsutil -sr_state” | Check the HANA system replication state on the local server. |
su – <sid>adm -c “python /hana/shared/<SID>/HDB<InstNum>/exe/python_support/systemReplicationStatus.py” | Check the current HANA system replication status. This command must be executed on the server which is the primary HANA system replication site. |
su – <sid>adm -c “hdbsql -n <HANA virtual hostname> -i <InstNum> -u SYSTEM -p <SYSTEM user password> -d <SID> ‘\s’” | Test the connection to the <SID> tenant database through the associated virtual hostname. |
/usr/sap/hostctrl/exe/saphostexec -status | Check the status of saphostexec. |
/usr/sap/hostctrl/exe/saphostexec -stop | Stop saphostexec. |
/usr/sap/hostctrl/exe/saphostexec -restart | Restart saphostexec. |
/usr/sap/hostctrl/exe/saposcol -s | Check the status of saposcol. |
/usr/sap/hostctrl/exe/saposcol -k | Stop saposcol. |
/usr/sap/hostctrl/exe/saposcol -l | Start saposcol. |
top -U <sid>adm ps -ef | grep <sid>adm |
View information for processes owned by the <sid>adm user. |
Post your comment on this topic.