Testing your SAP HANA Resource Hierarchy

Test Scenarios

To understand the behavior of the SAP HANA Recovery Kit, perform the following tests. The following prerequisites must be completed before performing any test:

LifeKeeper and the SAP HANA database must be installed and configured according to the installation instructions provided by SIOS and SAP.

SAP HANA System Replication must be enabled and active on all servers in the cluster, with the secondary replication site registered using one of the valid replication modes (sync, syncmem, or async) and operation modes (delta_datashipping, logreplay, or logreplay_readaccess). See Configure SAP HANA System Replication for more details.

If managing the switchable IP address associated with the SAP HANA database with a LifeKeeper IP resource, there must exist a dependency of the SAP HANA resource on the IP resource. See Step 7 in Creating an SAP HANA Resource Hierarchy for more details.

Test Cases

Before putting a highly-available SAP HANA cluster into production, it is very important that common failure and recovery scenarios have been thoroughly tested. The test cases provided on this page are meant to be used as a starting point when developing a comprehensive test plan for your highly-available SAP HANA cluster deployment. The following example values will be used throughout:

Primary Server Host Name	node1
Standby Server Host Name	node2
SAP SID	SPS
SAP HANA Database Instance	HDB00
SAP HANA LifeKeeper Resource Tag Name	HANA-SPS_HDB00
HANA System Replication Primary Site Name	SiteA
HANA System Replication Secondary Site Name	SiteB
HANA System Replication Mode	sync
HANA System Replication Operation Mode	logreplay

When testing, these sample values must be adapted to fit the environment where the tests are being performed.

Manual Switchover

The test cases in this section ensure that manual switchovers can be performed successfully.

Manual Switchover Test
Description
The SAP HANA resource hierarchy can be manually switched over from the primary server to the standby server.
Preconditions
Before performing this test, ensure that the following conditions are met: Both servers (node1 and node2) are operational, The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and HANA system replication is in-sync.
Test Steps
Bring the SAP HANA resource hierarchy in-service on node2 by using either of the following methods: In the LifeKeeper GUI, right-click the HANA-SPS_HDB00 resource on node2 and select “In Service…” from the context menu. On the resulting confirmation dialog, click “In Service” to begin the switchover process. From a terminal window on node2, execute the following command as a user with lkadmin group permissions (e.g., as the root user): [root@node2 ~]# sudo /opt/LifeKeeper/bin/lkcli resource restore --tag HANA-SPS_HDB00
Expected Results
The SAP HANA resource hierarchy is successfully taken out of service on node1. During this process, the database is stopped on node1 and any dependent resources (such as an associated virtual IP address, if applicable) are also removed on node1. The SAP HANA resource hierarchy is successfully brought in-service on node2. During this process: Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2. The running database instance on node2 is promoted to the primary replication role in HANA system replication. The database instance on node1 is registered as a secondary replication site in HANA system replication using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’). The database instance is started on node1. Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.

Handshake Takeover Test Note: This test case requires LifeKeeper v9.5.2 or later.
Description
The SAP HANA resource hierarchy can be manually switched over from the primary server to the standby server by using the SAP HANA “Takeover with Handshake” feature.
Preconditions
Before performing this test, ensure that: Both servers (node1 and node2) are operational, The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and HANA system replication is in-sync.
Test Steps
Bring the SAP HANA resource hierarchy in-service on node2 by using either of the following methods: In the LifeKeeper GUI, right-click the HANA-SPS_HDB00 resource on node2 and select “In Service – Takeover with Handshake…” from the context menu. On the resulting confirmation dialog, click “Perform Takeover” to begin the switchover process. From a terminal window on either node1 or node2, execute the following command as a user with lkadmin group permissions (e.g., as the root user): # sudo /opt/LifeKeeper/bin/lkcli resource config hana --tag HANA-SPS_HDB00 --takeover_with_handshake node2
Expected Results
The SAP HANA resource hierarchy is successfully taken out of service on node1. During this process the database is not stopped on node1, but any dependent resources (such as an associated virtual IP address, if applicable) are removed on node1. The SAP HANA resource hierarchy is successfully brought in-service on node2. During this process: Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2. The running database instance on node2 is promoted to the primary replication role in HANA system replication. During the HSR takeover process, the database instance on node1 is suspended. The suspended database instance on node1 is stopped. The database instance on node1 is registered as a secondary replication site in HANA system replication using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’). The database instance is started on node1. Note: The process of stopping, registering, and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.

Graceful Shutdown

The test cases in this section verify the expected behavior of the SAP HANA resource hierarchy when each server is gracefully rebooted.

Primary Server Reboot Test
Description
The SAP HANA resource hierarchy behaves as expected when the primary server is gracefully rebooted.
Preconditions
Before performing this test, ensure that: Both servers (node1 and node2) are operational, The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and HANA system replication is in-sync.
Test Steps
Gracefully reboot node1: [root@node1 ~]# reboot now
Expected Results
While node1 shuts down: The SAP HANA resource hierarchy is successfully taken out of service on node1. During this process, the database is stopped on node1 and any dependent resources (such as an associated virtual IP address, if applicable) are also removed on node1. [If “Switchover on Shutdown” is enabled on node1] The SAP HANA resource hierarchy is successfully brought in-service on node2. During this process: Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2. The running database instance on node2 is promoted to the primary replication role in HANA system replication. Since node1 has been shut down, HANA system replication is currently inactive. After node1 is back online: [If “Switchover on Shutdown” is disabled on node1] The SAP HANA resource hierarchy automatically comes back in-service on node1. During this process: Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node1. The database instance on node1 is started and HANA system replication resumes to the secondary replication site on node2. [If “Switchover on Shutdown” is enabled on node1] The SAP HANA resource hierarchy remains in-service on node2. During the first quickCheck cycle on node2 after node1 is back online, the SAP HANA Recovery Kit detects that the database instance is not running on node1 and fires a ‘remoteregisterdb’ event. The ‘remoteregisterdb’ event script registers node1 as a secondary HANA system replication site using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’) and starts the database instance on node1. Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.

Standby Server Reboot Test
Description
The SAP HANA resource hierarchy behaves as expected when the standby server is gracefully rebooted.
Preconditions
Before performing this test, ensure that: Both servers (node1 and node2) are operational, The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and HANA system replication is in-sync.
Test Steps
Gracefully reboot node2: [root@node2 ~]# reboot now
Expected Results
While node2 shuts down, the database instance is stopped on node2. HANA system replication will be inactive until node2 reboots and the secondary database instance is restarted. After node2 is back online: During the first quickCheck cycle on node1 after node2 is back online, the SAP HANA Recovery Kit detects that the database instance is not running on node2 and fires a ‘remoteregisterdb’ event. The ‘remoteregisterdb’ event script starts the database instance on node2. Note: The process of restarting the database on node2 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node2 as ‘Standby – In Sync’.

Machine Failover

The test case in this section verifies the expected behavior of the SAP HANA resource hierarchy when the primary server is forcefully rebooted.

Machine Failover Test
Description
The SAP HANA resource hierarchy behaves as expected when the primary server is forcefully rebooted.
Preconditions
Before performing this test, ensure that: Both servers (node1 and node2) are operational, The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and HANA system replication is in-sync.
Test Steps
Forcefully reboot node1: [root@node1 ~]# echo b > /proc/sysrq-trigger
Expected Results
Once LifeKeeper on node2 detects that node1 is down (the exact time will vary depending on the values being used for the LifeKeeper heartbeat parameters), the SAP HANA resource hierarchy is successfully brought in-service on node2. During this process: Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2. The running database instance on node2 is promoted to the primary replication role in HANA system replication. Since node1 has been shut down, HANA system replication is currently inactive. After node1 is back online, the SAP HANA resource hierarchy remains in-service on node2. During the first quickCheck cycle on node2 after node1 is back online, the SAP HANA Recovery Kit detects that the database instance is not running on node1 and fires a ‘remoteregisterdb’ event. The ‘remoteregisterdb’ event script registers node1 as a secondary HANA system replication site using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’) and starts the database instance on node1. Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.

SAP Host Agent Failure

The test cases in this section verify the expected behavior of the SAP HANA resource hierarchy when supporting SAP Host Agent-related processes fail on each server.

Primary Server SAP Host Exec Failure Test
Description
The SAP HANA resource hierarchy behaves as expected when the saphostexec process is killed on the primary server.
Preconditions
Before performing this test, ensure that: Both servers (node1 and node2) are operational, The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and HANA system replication is in-sync.
Test Steps
Stop the saphostexec process on node1: [root@node1 ~]# /usr/sap/hostctrl/exe/saphostexec -stop
Expected Results
During the next quickCheck interval, the SAP HANA Recovery Kit detects that the saphostexec process is no longer running on node1 and fires a ‘recover’ event to restart it. The ‘recover’ event script restarts the saphostexec process on node1.

Standby Server SAP Host Exec Failure Test
Description
The SAP HANA resource hierarchy behaves as expected when the saphostexec process is killed on the standby server.
Preconditions
Before performing this test, ensure that: Both servers (node1 and node2) are operational, The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and HANA system replication is in-sync.
Test Steps
Stop the saphostexec process on node2: [root@node2 ~]# /usr/sap/hostctrl/exe/saphostexec -stop
Expected Results
During the next quickCheck interval, the SAP HANA Recovery Kit detects that the saphostexec process is no longer running on node2 and fires a ‘remoteregisterdb’ event to restart it. The ‘remoteregisterdb’ event script restarts the saphostexec process on node2.

Primary Server SAP OS Collector Failure Test
Description
The SAP HANA resource hierarchy behaves as expected when the saposcol process is killed on the primary server.
Preconditions
Before performing this test, ensure that: Both servers (node1 and node2) are operational, The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and HANA system replication is in-sync.
Test Steps
Stop the saposcol process on node1: [root@node1 ~]# /usr/sap/hostctrl/exe/saposcol -k
Expected Results
During the next quickCheck interval, the SAP HANA Recovery Kit detects that the saposcol process is no longer running on node1 and fires a ‘recover’ event to restart it. The ‘recover’ event script restarts the saposcol process on node1.

Standby Server SAP OS Collector Failure Test
Description
The SAP HANA resource hierarchy behaves as expected when the saposcol process is killed on the standby server.
Preconditions
Before performing this test, ensure that: Both servers (node1 and node2) are operational, The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and HANA system replication is in-sync.
Test Steps
Stop the saposcol process on node2: [root@node2 ~]# /usr/sap/hostctrl/exe/saposcol -k
Expected Results
During the next quickCheck interval, the SAP HANA Recovery Kit detects that the saposcol process is no longer running on node2 and fires a ‘remoteregisterdb’ event to restart it. The ‘remoteregisterdb’ event script restarts the saposcol process on node2.

Database Failure

The test cases in this section verify the expected behavior of the SAP HANA resource hierarchy when processes within the protected HANA database instance fail.

Notes:

Some of the expected results in this section depend on whether local recovery is enabled or disabled for the SAP HANA resource. Local recovery is enabled by default. See Setting Local and Temporal Recovery Policies for SAP HANA Resources for more details.

If the database instance is stopped gracefully when performing these tests (e.g., with an HDB stop command), the background process which gracefully stops the database may conflict with the SAP HANA Recovery Kit’s attempt to restart the database locally and may lead to a failover of the SAP HANA resource hierarchy. For this reason, we recommend simulating a crash of the HDB instance by forcefully and immediately killing the database processes at the operating system level, for example with the HDB kill-9 command. See SAP HANA – Known Issues / Restrictions for more details.

Primary Server Database Instance Failure Test
Description
The SAP HANA resource hierarchy behaves as expected when the HDB instance processes are killed on the primary server.
Preconditions
Before performing this test, ensure that: Both servers (node1 and node2) are operational, The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and HANA system replication is in-sync.
Test Steps
As the HANA administrative user (<sid>adm), forcefully kill the HDB instance processes (at the operating system level) on node1: [root@node1 ~]# su – spsadm -c “HDB kill-9”
Expected Results
[If local recovery is enabled for the HANA-SPS_HDB00 resource on node1] The SAP HANA Recovery Kit detects the failure and restarts the database instance on node1: During the next quickCheck interval, the SAP HANA Recovery Kit detects that the HDB instance processes are no longer running on node1 and fires a ‘recover’ event to restart them. The ‘recover’ event script restarts the HDB instance processes on node1. [If local recovery is disabled for the HANA-SPS_HDB00 resource on node1] LifeKeeper immediately initiates a failover of the SAP HANA resource hierarchy to node2: The SAP HANA resource hierarchy is successfully taken out of service on node1. During this process, any dependent resources (such as an associated virtual IP address, if applicable) are removed on node1. The SAP HANA resource hierarchy is successfully brought in-service on node2. During this process: Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2. The running database instance on node2 is promoted to the primary replication role in HANA system replication. The database instance on node1 is registered as a secondary replication site in HANA system replication using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’). The database instance is started on node1. Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.

Standby Server Database Instance Failure Test
Description
The SAP HANA resource hierarchy behaves as expected when the HDB instance processes are killed on the standby server.
Preconditions
Before performing this test, ensure that: Both servers (node1 and node2) are operational, The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and HANA system replication is in-sync.
Test Steps
As the HANA administrative user (<sid>adm), forcefully kill the HDB instance processes (at the operating system level) on node2: [root@node2 ~]# su – spsadm -c “HDB kill-9”
Expected Results
During the next quickCheck interval, the SAP HANA Recovery Kit detects that the HDB instance processes are no longer running on node2 and fires a ‘remoteregisterdb’ event to restart them. The ‘remoteregisterdb’ event script restarts the HDB instance processes on node2.

Appendix: Useful SAP HANA Administrative Commands

While the status of the SAP HANA environment may be monitored through SAP-provided dashboards (e.g., HANA Studio or HANA Cockpit), the following commands may also be useful while testing. Throughout, <sid> denotes the lowercase SAP SID for the protected SAP HANA database installation and <InstNum> denotes the instance number of the protected HDB instance (e.g., for instance HDB00, <InstNum> is 00).

Command	Description
su – <sid>adm -c “sapcontrol -nr <InstNum> -function StopService”	Stop the sapstartsrv process for the HDB instance.
su – <sid>adm -c “sapcontrol -nr <InstNum> -function StartService <SID>”	Start the sapstartsrv process for the HDB instance.
su – <sid>adm -c “sapcontrol -nr <InstNum> -function GetProcessList”	View current status of HDB instance processes.
su – <sid>adm -c “HDB stop”	Gracefully stop the HDB instance.
su – <sid>adm -c “HDB kill-9”	Forcefully kill the HDB instance processes.
su – <sid>adm -c “HDB start”	Start the HDB instance.
su – <sid>adm -c “hdbnsutil -sr_state”	Check the HANA system replication state on the local server.
su – <sid>adm -c “python /hana/shared/<SID>/HDB<InstNum>/exe/python_support/systemReplicationStatus.py”	Check the current HANA system replication status. This command must be executed on the server which is the primary HANA system replication site.
su – <sid>adm -c “hdbsql -n <HANA virtual hostname> -i <InstNum> -u SYSTEM -p <SYSTEM user password> -d <SID> ‘\s’”	Test the connection to the <SID> tenant database through the associated virtual hostname.
/usr/sap/hostctrl/exe/saphostexec -status	Check the status of saphostexec.
/usr/sap/hostctrl/exe/saphostexec -stop	Stop saphostexec.
/usr/sap/hostctrl/exe/saphostexec -restart	Restart saphostexec.
/usr/sap/hostctrl/exe/saposcol -s	Check the status of saposcol.
/usr/sap/hostctrl/exe/saposcol -k	Stop saposcol.
/usr/sap/hostctrl/exe/saposcol -l	Start saposcol.
top -U <sid>adm ps -ef \| grep <sid>adm	View information for processes owned by the <sid>adm user.

Deleting an SAP HANA Resource Hierarchy

SAP HANA Resource Hierarchy Administration

Feedback

Post your comment on this topic.