Test Scenarios
To understand the behavior of the SAP HANA Recovery Kit, perform the following tests. The following prerequisites must be completed before performing any test:
- LifeKeeper and the SAP HANA database must be installed and configured according to the installation instructions provided by SIOS and SAP.
- SAP HANA System Replication must be enabled and active on all servers in the cluster, with the secondary replication site registered using one of the valid replication modes (sync, syncmem, or async) and operation modes (delta_datashipping, logreplay, or logreplay_readaccess). See Configure SAP HANA System Replication for more details.
- If managing the switchable IP address associated with the SAP HANA database with a LifeKeeper IP resource, there must exist a dependency of the SAP HANA resource on the IP resource. See Step 7 in Creating an SAP HANA Resource Hierarchy for more details.
Test Cases
Before putting a highly-available SAP HANA cluster into production, it is very important that common failure and recovery scenarios have been thoroughly tested. The test cases provided on this page are meant to be used as a starting point when developing a comprehensive test plan for your highly-available SAP HANA cluster deployment. The following example values will be used throughout:
Primary Server Host Name |
node1 |
Standby Server Host Name |
node2 |
SAP SID |
SPS |
SAP HANA Database Instance |
HDB00 |
SAP HANA LifeKeeper Resource Tag Name |
HANA-SPS_HDB00 |
HANA System Replication Primary Site Name |
SiteA |
HANA System Replication Secondary Site Name |
SiteB |
HANA System Replication Mode |
sync |
HANA System Replication Operation Mode |
logreplay |
When testing, these sample values must be adapted to fit the environment where the tests are being performed.
*Note: The following test cases assume that the SAP HANA resource hierarchy is in-service on the primary server (node1). For full coverage, the same tests should also be performed while the resource hierarchy is in-service on node2. To test these scenarios with the server roles reversed, make the substitutions “primary” ↔ “standby”, “node1” ↔ “node2”, and “SiteA” ↔ “SiteB” in each of the test cases given below.
Manual Switchover
The test cases in this section ensure that manual switchovers can be performed successfully.
Manual Switchover Test |
Description |
The SAP HANA resource hierarchy can be manually switched over from the primary server to the standby server. |
Preconditions |
Before performing this test, ensure that the following conditions are met:
- Both servers (node1 and node2) are operational,
- The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
- HANA system replication is in-sync.
|
Test Steps |
- Bring the SAP HANA resource hierarchy in-service on node2 by using either of the following methods:
- In the LifeKeeper GUI, right-click the HANA-SPS_HDB00 resource on node2 and select “In Service…” from the context menu. On the resulting confirmation dialog, click “In Service” to begin the switchover process.
- From a terminal window on node2, execute the following command as a user with lkadmin group permissions (e.g., as the root user):
[root@node2 ~]# sudo /opt/LifeKeeper/bin/lkcli resource restore --tag HANA-SPS_HDB00
|
Expected Results |
- The SAP HANA resource hierarchy is successfully taken out of service on node1. During this process, the database is stopped on node1 and any dependent resources (such as an associated virtual IP address, if applicable) are also removed on node1.
- The SAP HANA resource hierarchy is successfully brought in-service on node2. During this process:
- Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2.
- The running database instance on node2 is promoted to the primary replication role in HANA system replication.
- The database instance on node1 is registered as a secondary replication site in HANA system replication using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’).
- The database instance is started on node1.
Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.
|
Handshake Takeover Test
Note: This test case requires LifeKeeper v9.5.2 or later. |
Description |
The SAP HANA resource hierarchy can be manually switched over from the primary server to the standby server by using the SAP HANA “Takeover with Handshake” feature. |
Preconditions |
Before performing this test, ensure that:
- Both servers (node1 and node2) are operational,
- The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
- HANA system replication is in-sync.
|
Test Steps |
- Bring the SAP HANA resource hierarchy in-service on node2 by using either of the following methods:
- In the LifeKeeper GUI, right-click the HANA-SPS_HDB00 resource on node2 and select “In Service – Takeover with Handshake…” from the context menu. On the resulting confirmation dialog, click “Perform Takeover” to begin the switchover process.
- From a terminal window on either node1 or node2, execute the following command as a user with lkadmin group permissions (e.g., as the root user):
# sudo /opt/LifeKeeper/bin/lkcli resource config hana --tag HANA-SPS_HDB00 --takeover_with_handshake node2
|
Expected Results |
- The SAP HANA resource hierarchy is successfully taken out of service on node1. During this process the database is not stopped on node1, but any dependent resources (such as an associated virtual IP address, if applicable) are removed on node1.
- The SAP HANA resource hierarchy is successfully brought in-service on node2. During this process:
- Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2.
- The running database instance on node2 is promoted to the primary replication role in HANA system replication. During the HSR takeover process, the database instance on node1 is suspended.
- The suspended database instance on node1 is stopped.
- The database instance on node1 is registered as a secondary replication site in HANA system replication using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’).
- The database instance is started on node1.
Note: The process of stopping, registering, and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.
|
Graceful Shutdown
The test cases in this section verify the expected behavior of the SAP HANA resource hierarchy when each server is gracefully rebooted.
*Note: The expected results in the first test depend on whether the “Switchover on Shutdown” strategy is enabled or disabled for the primary server. See Setting Server Shutdown Strategy for more details.
Primary Server Reboot Test |
Description |
The SAP HANA resource hierarchy behaves as expected when the primary server is gracefully rebooted. |
Preconditions |
Before performing this test, ensure that:
- Both servers (node1 and node2) are operational,
- The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
- HANA system replication is in-sync.
|
Test Steps |
- Gracefully reboot node1:
[root@node1 ~]# reboot now
|
Expected Results |
- While node1 shuts down:
- The SAP HANA resource hierarchy is successfully taken out of service on node1. During this process, the database is stopped on node1 and any dependent resources (such as an associated virtual IP address, if applicable) are also removed on node1.
- [If “Switchover on Shutdown” is enabled on node1] The SAP HANA resource hierarchy is successfully brought in-service on node2. During this process:
- Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2.
- The running database instance on node2 is promoted to the primary replication role in HANA system replication. Since node1 has been shut down, HANA system replication is currently inactive.
- After node1 is back online:
- [If “Switchover on Shutdown” is disabled on node1] The SAP HANA resource hierarchy automatically comes back in-service on node1. During this process:
- Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node1.
- The database instance on node1 is started and HANA system replication resumes to the secondary replication site on node2.
- [If “Switchover on Shutdown” is enabled on node1] The SAP HANA resource hierarchy remains in-service on node2.
- During the first quickCheck cycle on node2 after node1 is back online, the SAP HANA Recovery Kit detects that the database instance is not running on node1 and fires a ‘remoteregisterdb’ event.
- The ‘remoteregisterdb’ event script registers node1 as a secondary HANA system replication site using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’) and starts the database instance on node1.
Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.
|
Standby Server Reboot Test |
Description |
The SAP HANA resource hierarchy behaves as expected when the standby server is gracefully rebooted. |
Preconditions |
Before performing this test, ensure that:
- Both servers (node1 and node2) are operational,
- The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
- HANA system replication is in-sync.
|
Test Steps |
- Gracefully reboot node2:
[root@node2 ~]# reboot now
|
Expected Results |
- While node2 shuts down, the database instance is stopped on node2. HANA system replication will be inactive until node2 reboots and the secondary database instance is restarted.
- After node2 is back online:
- During the first quickCheck cycle on node1 after node2 is back online, the SAP HANA Recovery Kit detects that the database instance is not running on node2 and fires a ‘remoteregisterdb’ event.
- The ‘remoteregisterdb’ event script starts the database instance on node2.
Note: The process of restarting the database on node2 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node2 as ‘Standby – In Sync’.
|
Machine Failover
The test case in this section verifies the expected behavior of the SAP HANA resource hierarchy when the primary server is forcefully rebooted.
Machine Failover Test |
Description |
The SAP HANA resource hierarchy behaves as expected when the primary server is forcefully rebooted. |
Preconditions |
Before performing this test, ensure that:
- Both servers (node1 and node2) are operational,
- The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
- HANA system replication is in-sync.
|
Test Steps |
- Forcefully reboot node1:
[root@node1 ~]# echo b > /proc/sysrq-trigger
|
Expected Results |
- Once LifeKeeper on node2 detects that node1 is down (the exact time will vary depending on the values being used for the LifeKeeper heartbeat parameters), the SAP HANA resource hierarchy is successfully brought in-service on node2. During this process:
- Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2.
- The running database instance on node2 is promoted to the primary replication role in HANA system replication. Since node1 has been shut down, HANA system replication is currently inactive.
- After node1 is back online, the SAP HANA resource hierarchy remains in-service on node2.
- During the first quickCheck cycle on node2 after node1 is back online, the SAP HANA Recovery Kit detects that the database instance is not running on node1 and fires a ‘remoteregisterdb’ event.
- The ‘remoteregisterdb’ event script registers node1 as a secondary HANA system replication site using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’) and starts the database instance on node1.
Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.
|
SAP Host Agent Failure
The test cases in this section verify the expected behavior of the SAP HANA resource hierarchy when supporting SAP Host Agent-related processes fail on each server.
Primary Server SAP Host Exec Failure Test |
Description |
The SAP HANA resource hierarchy behaves as expected when the saphostexec process is killed on the primary server. |
Preconditions |
Before performing this test, ensure that:
Both servers (node1 and node2) are operational,
The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
HANA system replication is in-sync. |
Test Steps |
Stop the saphostexec process on node1:
[root@node1 ~]# /usr/sap/hostctrl/exe/saphostexec -stop |
Expected Results |
- During the next quickCheck interval, the SAP HANA Recovery Kit detects that the saphostexec process is no longer running on node1 and fires a ‘recover’ event to restart it.
- The ‘recover’ event script restarts the saphostexec process on node1.
|
Standby Server SAP Host Exec Failure Test |
Description |
The SAP HANA resource hierarchy behaves as expected when the saphostexec process is killed on the standby server. |
Preconditions |
Before performing this test, ensure that:
- Both servers (node1 and node2) are operational,
- The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
- HANA system replication is in-sync.
|
Test Steps |
- Stop the saphostexec process on node2:
[root@node2 ~]# /usr/sap/hostctrl/exe/saphostexec -stop
|
Expected Results |
- During the next quickCheck interval, the SAP HANA Recovery Kit detects that the saphostexec process is no longer running on node2 and fires a ‘remoteregisterdb’ event to restart it.
- The ‘remoteregisterdb’ event script restarts the saphostexec process on node2.
|
Primary Server SAP OS Collector Failure Test |
Description |
The SAP HANA resource hierarchy behaves as expected when the saposcol process is killed on the primary server. |
Preconditions |
Before performing this test, ensure that:
- Both servers (node1 and node2) are operational,
- The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
- HANA system replication is in-sync.
|
Test Steps |
Stop the saposcol process on node1:
[root@node1 ~]# /usr/sap/hostctrl/exe/saposcol -k |
Expected Results |
- During the next quickCheck interval, the SAP HANA Recovery Kit detects that the saposcol process is no longer running on node1 and fires a ‘recover’ event to restart it.
- The ‘recover’ event script restarts the saposcol process on node1.
|
Standby Server SAP OS Collector Failure Test |
Description |
The SAP HANA resource hierarchy behaves as expected when the saposcol process is killed on the standby server. |
Preconditions |
Before performing this test, ensure that:
- Both servers (node1 and node2) are operational,
- The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
- HANA system replication is in-sync.
|
Test Steps |
Stop the saposcol process on node2:
[root@node2 ~]# /usr/sap/hostctrl/exe/saposcol -k |
Expected Results |
- During the next quickCheck interval, the SAP HANA Recovery Kit detects that the saposcol process is no longer running on node2 and fires a ‘remoteregisterdb’ event to restart it.
- The ‘remoteregisterdb’ event script restarts the saposcol process on node2.
|
Database Failure
The test cases in this section verify the expected behavior of the SAP HANA resource hierarchy when processes within the protected HANA database instance fail.
Notes:
- If the database instance is stopped gracefully when performing these tests (e.g., with an HDB stop command), the background process which gracefully stops the database may conflict with the SAP HANA Recovery Kit’s attempt to restart the database locally and may lead to a failover of the SAP HANA resource hierarchy. For this reason, we recommend simulating a crash of the HDB instance by forcefully and immediately killing the database processes at the operating system level, for example with the HDB kill-9 command. See SAP HANA – Known Issues / Restrictions for more details.
Primary Server Database Instance Failure Test |
Description |
The SAP HANA resource hierarchy behaves as expected when the HDB instance processes are killed on the primary server. |
Preconditions |
Before performing this test, ensure that:
- Both servers (node1 and node2) are operational,
- The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
- HANA system replication is in-sync.
|
Test Steps |
As the HANA administrative user (<sid>adm), forcefully kill the HDB instance processes (at the operating system level) on node1:
[root@node1 ~]# su – spsadm -c “HDB kill-9” |
Expected Results |
- [If local recovery is enabled for the HANA-SPS_HDB00 resource on node1] The SAP HANA Recovery Kit detects the failure and restarts the database instance on node1:
- During the next quickCheck interval, the SAP HANA Recovery Kit detects that the HDB instance processes are no longer running on node1 and fires a ‘recover’ event to restart them.
- The ‘recover’ event script restarts the HDB instance processes on node1.
- [If local recovery is disabled for the HANA-SPS_HDB00 resource on node1] LifeKeeper immediately initiates a failover of the SAP HANA resource hierarchy to node2:
- The SAP HANA resource hierarchy is successfully taken out of service on node1. During this process, any dependent resources (such as an associated virtual IP address, if applicable) are removed on node1.
- The SAP HANA resource hierarchy is successfully brought in-service on node2. During this process:
- Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2.
- The running database instance on node2 is promoted to the primary replication role in HANA system replication.
- The database instance on node1 is registered as a secondary replication site in HANA system replication using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’).
- The database instance is started on node1.
Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.
|
Standby Server Database Instance Failure Test |
Description |
The SAP HANA resource hierarchy behaves as expected when the HDB instance processes are killed on the standby server. |
Preconditions |
Before performing this test, ensure that:
- Both servers (node1 and node2) are operational,
- The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
- HANA system replication is in-sync.
|
Test Steps |
As the HANA administrative user (<sid>adm), forcefully kill the HDB instance processes (at the operating system level) on node2:
[root@node2 ~]# su – spsadm -c “HDB kill-9” |
Expected Results |
- During the next quickCheck interval, the SAP HANA Recovery Kit detects that the HDB instance processes are no longer running on node2 and fires a ‘remoteregisterdb’ event to restart them.
- The ‘remoteregisterdb’ event script restarts the HDB instance processes on node2.
|
Appendix: Useful SAP HANA Administrative Commands
While the status of the SAP HANA environment may be monitored through SAP-provided dashboards (e.g., HANA Studio or HANA Cockpit), the following commands may also be useful while testing. Throughout, <sid> denotes the lowercase SAP SID for the protected SAP HANA database installation and <InstNum> denotes the instance number of the protected HDB instance (e.g., for instance HDB00, <InstNum> is 00).
Command |
Description |
su – <sid>adm -c “sapcontrol -nr <InstNum> -function StopService” |
Stop the sapstartsrv process for the HDB instance. |
su – <sid>adm -c “sapcontrol -nr <InstNum> -function StartService <SID>” |
Start the sapstartsrv process for the HDB instance. |
su – <sid>adm -c “sapcontrol -nr <InstNum> -function GetProcessList” |
View current status of HDB instance processes. |
su – <sid>adm -c “HDB stop” |
Gracefully stop the HDB instance. |
su – <sid>adm -c “HDB kill-9” |
Forcefully kill the HDB instance processes. |
su – <sid>adm -c “HDB start” |
Start the HDB instance. |
su – <sid>adm -c “hdbnsutil -sr_state” |
Check the HANA system replication state on the local server. |
su – <sid>adm -c “python /hana/shared/<SID>/HDB<InstNum>/exe/python_support/systemReplicationStatus.py” |
Check the current HANA system replication status. This command must be executed on the server which is the primary HANA system replication site. |
su – <sid>adm -c “hdbsql -n <HANA virtual hostname> -i <InstNum> -u SYSTEM -p <SYSTEM user password> -d <SID> ‘\s’” |
Test the connection to the <SID> tenant database through the associated virtual hostname. |
/usr/sap/hostctrl/exe/saphostexec -status |
Check the status of saphostexec. |
/usr/sap/hostctrl/exe/saphostexec -stop |
Stop saphostexec. |
/usr/sap/hostctrl/exe/saphostexec -restart |
Restart saphostexec. |
/usr/sap/hostctrl/exe/saposcol -s |
Check the status of saposcol. |
/usr/sap/hostctrl/exe/saposcol -k |
Stop saposcol. |
/usr/sap/hostctrl/exe/saposcol -l |
Start saposcol. |
top -U <sid>adm
ps -ef | grep <sid>adm |
View information for processes owned by the <sid>adm user. |
Post your comment on this topic.