Test Scenarios

To understand the behavior of the SAP HANA Recovery Kit, perform the following tests. The following prerequisites must be completed before performing any test:

  • LifeKeeper and the SAP HANA database must be installed and configured according to the installation instructions provided by SIOS and SAP.
  • SAP HANA System Replication must be enabled and active on all servers in the cluster, with the secondary replication site registered using one of the valid replication modes (sync, syncmem, or async) and operation modes (delta_datashipping, logreplay, or logreplay_readaccess). See Configure SAP HANA System Replication for more details.
  • If managing the switchable IP address associated with the SAP HANA database with a LifeKeeper IP resource, there must exist a dependency of the SAP HANA resource on the IP resource. See Step 7 in Creating an SAP HANA Resource Hierarchy for more details.

Test Recovery of SAP Host Agent

Determine the status and the process ID’s of the SAP Host Agent processes by using:

font face=“Courier New”># /usr/sap/hostctrl/exe/saphostexec –status

saphostexec running (pid = 3818)

sapstartsrv running (pid = 3867)

saposcol running (pid = 3965)

Either manually kill one of the processes listed in the output or execute

/usr/sap/hostctrl/exe/saphostexec –stop

to impair the functionality of SAP Host Agent. The SAP HANA Recovery Kit will recognize that SAP Host Agent is not working properly and restart it on that node. The behavior can be observed by monitoring the LifeKeeper log with the following command:

tail -f /var/log/lifekeeper.log

During this recovery process, the SAP HANA resource does not change its state. After a successful recovery, SAP Host Agent is fully functional again. If the recovery kit is unable to restart SAP Host Agent, the HANA database and the resource remains in its current state. SAP Host Agent will be checked again and if possible restarted later.

Test Recovery of sapstartsrv for the SAP HANA Instance

To test the recovery of the SAP Start Service (sapstartsrv) for the SAP HANA instance, the service must be stopped. One method to stop sapstartsrv is by executing the sapcontrol StopService webmethod:

su – <sid>adm -c “sapcontrol -nr <Inst#> -function StopService”

where <sid> is the lower-case SAP System ID for the HANA installation and <Inst#> is the HDB instance number. Another method is to kill the sapstartsrv process directly. In either case, sapstartsrv will be restarted by SAP HANA Recovery Kit. The resource does not change its state as long as sapstartsrv is able to be restarted successfully.

Test Recovery of the Secondary SAP HANA DB (Replication Target)

In the event of a failure of the secondary database instance (replication target) or if the secondary replication site is unregistered in SAP HANA System Replication, the recovery kit will re-register the secondary site with the previous replication and operation modes and restart the secondary database instance.

To induce such a failure, execute one of the following commands on the secondary replication site:

su – <sid>adm -c “sapcontrol -nr <Inst#> -function Stop”

su – <sid>adm -c “hdbnsutil -sr_unregister”

The behavior can be observed by monitoring the log file /var/log/lifekeeper.log. After the recovery, the state of the database instance and SAP HANA System Replication can be tested by running the following commands on the secondary replication site:

su – <sid>adm -c “sapcontrol -nr <Inst#> -function GetProcessList”

su – <sid>adm -c “hdbnsutil -sr_state”

In the event that the secondary database instance cannot be started by the recovery kit, the SAP HANA resource is flagged as Failed (OSF) on the corresponding node.

Once the cause of an unsuccessful start is fixed by an administrator, the SAP HANA Recovery Kit will start the database instance in the subsequent quickCheck cycle. Once started successfully, the resource state will be updated to Standby (OSU) on the corresponding node.

Test Recovery of the Primary SAP HANA DB

In the event of a failure of the primary database instance (replication source), the replication mode of the database instance on the primary node is determined. If the replication mode is set to primary, the database instance will be started again. If the mode is not set to primary, the recovery kit will log a warning stating that the replication mode has been changed outside of LifeKeeper and suspend all monitoring of the SAP HANA resource until the issue is resolved. In the latter case, manual intervention is required to bring the HANA resource hierarchy in-service on the correct primary system. The behavior in this case can be observed in the LifeKeeper GUI, which will show the state “Active – HSR Disabled”, “Active – Unknown Repl Mode”, or “Active – Secondary” for the resource on the primary node, or by monitoring the log file /var/log/lifekeeper.log.

A failure of the primary database instance can be induced by running the following command on the primary replication site:

su – <sid>adm -c “sapcontrol -nr <Inst#> -function Stop”

After the recovery, the state of the database and the replication can be tested by using:

su – <sid>adm -c “sapcontrol -nr <Inst#> -function GetProcessList”

su – <sid>adm -c “hdbnsutil -sr_state”

In the event that the primary database instance cannot be started by the recovery kit on that node, LifeKeeper will initiate a failover of the entire hierarchy to the secondary node. On this node, the HANA Recovery Kit performs a takeover of SAP HANA System Replication and the previous secondary node becomes the new primary node for replication. LifeKeeper will attempt to re-register the faulty node as the secondary replication site using the previous replication and operation modes. If this is successful, the secondary database is also restarted. In the event that either the secondary node cannot be successfully registered as the secondary replication site or that the database cannot be successfully restarted on the secondary node, the HANA resource will be flagged as Failed (OSF) on the corresponding node. At this point, manual intervention is typically necessary to eliminate the cause of the failure. If the failover of the primary database instance failed, the resource is flagged as faulty Failed (OSF) and remains in this state until a manual in-service operation is performed by an administrator.

Test Machine Failure of the Secondary Node (reboot -f, power off)

If an error causes the secondary node to fail, the resource remains Active (ISP) on the primary node but SAP HANA System Replication is disrupted. Once the secondary node is restarted and LifeKeeper is active, the secondary database instance is automatically restarted as a replication target.

Test Machine Failure of the Primary Node (reboot -f, power off)

If an error causes the primary node to fail, a failover of the HANA resource hierarchy to the secondary node is initiated. A takeover of SAP HANA System Replication is performed on the secondary node and the previous secondary replication site becomes the new primary replication site. Once the faulty node is restarted and LifeKeeper is active, the node is registered as a secondary replication site and the database instance is automatically restarted as a replication target.

Additional Test Cases

Before putting a highly-available SAP HANA cluster into production, it is very important that common failure and recovery scenarios have been thoroughly tested. The test cases provided on this page are meant to be used as a starting point when developing a comprehensive test plan for your highly-available SAP HANA cluster deployment. The following example values will be used throughout:

Primary Server Host Name node1
Standby Server Host Name node2
SAP SID SPS
SAP HANA Database Instance HDB00
SAP HANA LifeKeeper Resource Tag Name HANA-SPS_HDB00
HANA System Replication Primary Site Name SiteA
HANA System Replication Secondary Site Name SiteB
HANA System Replication Mode sync
HANA System Replication Operation Mode logreplay

When testing, these sample values must be adapted to fit the environment where the tests are being performed.

Manual Switchover

The test cases in this section ensure that manual switchovers can be performed successfully.


Manual Switchover Test
Description
The SAP HANA resource hierarchy can be manually switched over from the primary server to the standby server.
Preconditions
Before performing this test, ensure that the following conditions are met:

  • Both servers (node1 and node2) are operational,
  • The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
  • HANA system replication is in-sync.
Test Steps
  1. Bring the SAP HANA resource hierarchy in-service on node2 by using either of the following methods:

    1. In the LifeKeeper GUI, right-click the HANA-SPS_HDB00 resource on node2 and select “In Service…” from the context menu. On the resulting confirmation dialog, click “In Service” to begin the switchover process.

    2. From a terminal window on node2, execute the following command as a user with lkadmin group permissions (e.g., as the root user):

      [root@node2 ~]# sudo /opt/LifeKeeper/bin/lkcli resource restore --tag HANA-SPS_HDB00
Expected Results
  1. The SAP HANA resource hierarchy is successfully taken out of service on node1. During this process, the database is stopped on node1 and any dependent resources (such as an associated virtual IP address, if applicable) are also removed on node1.

  2. The SAP HANA resource hierarchy is successfully brought in-service on node2. During this process:
    1. Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2.
    2. The running database instance on node2 is promoted to the primary replication role in HANA system replication.
    3. The database instance on node1 is registered as a secondary replication site in HANA system replication using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’).
    4. The database instance is started on node1.

Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.

Handshake Takeover Test
Note: This test case requires LifeKeeper v9.5.2 or later.
Description
The SAP HANA resource hierarchy can be manually switched over from the primary server to the standby server by using the SAP HANA “Takeover with Handshake” feature.
Preconditions
Before performing this test, ensure that:

  • Both servers (node1 and node2) are operational,
  • The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
  • HANA system replication is in-sync.
Test Steps
  1. Bring the SAP HANA resource hierarchy in-service on node2 by using either of the following methods:

    1. In the LifeKeeper GUI, right-click the HANA-SPS_HDB00 resource on node2 and select “In Service – Takeover with Handshake…” from the context menu. On the resulting confirmation dialog, click “Perform Takeover” to begin the switchover process.

    2. From a terminal window on either node1 or node2, execute the following command as a user with lkadmin group permissions (e.g., as the root user):

      # sudo /opt/LifeKeeper/bin/lkcli resource config hana --tag HANA-SPS_HDB00 --takeover_with_handshake node2
Expected Results
  1. The SAP HANA resource hierarchy is successfully taken out of service on node1. During this process the database is not stopped on node1, but any dependent resources (such as an associated virtual IP address, if applicable) are removed on node1.

  2. The SAP HANA resource hierarchy is successfully brought in-service on node2. During this process:
    1. Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2.
    2. The running database instance on node2 is promoted to the primary replication role in HANA system replication. During the HSR takeover process, the database instance on node1 is suspended.
    3. The suspended database instance on node1 is stopped.
    4. The database instance on node1 is registered as a secondary replication site in HANA system replication using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’).
    5. The database instance is started on node1.

Note: The process of stopping, registering, and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.

Graceful Shutdown

The test cases in this section verify the expected behavior of the SAP HANA resource hierarchy when each server is gracefully rebooted.


Primary Server Reboot Test
Description
The SAP HANA resource hierarchy behaves as expected when the primary server is gracefully rebooted.
Preconditions
Before performing this test, ensure that:

  • Both servers (node1 and node2) are operational,
  • The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
  • HANA system replication is in-sync.
Test Steps
  1. Gracefully reboot node1:

    [root@node1 ~]# reboot now
Expected Results
  1. While node1 shuts down:
    1. The SAP HANA resource hierarchy is successfully taken out of service on node1. During this process, the database is stopped on node1 and any dependent resources (such as an associated virtual IP address, if applicable) are also removed on node1.
    2. [If “Switchover on Shutdown” is enabled on node1] The SAP HANA resource hierarchy is successfully brought in-service on node2. During this process:
      1. Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2.
      2. The running database instance on node2 is promoted to the primary replication role in HANA system replication. Since node1 has been shut down, HANA system replication is currently inactive.

  2. After node1 is back online:
    1. [If “Switchover on Shutdown” is disabled on node1] The SAP HANA resource hierarchy automatically comes back in-service on node1. During this process:
      1. Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node1.
      2. The database instance on node1 is started and HANA system replication resumes to the secondary replication site on node2.
    2. [If “Switchover on Shutdown” is enabled on node1] The SAP HANA resource hierarchy remains in-service on node2.
      1. During the first quickCheck cycle on node2 after node1 is back online, the SAP HANA Recovery Kit detects that the database instance is not running on node1 and fires a ‘remoteregisterdb’ event.
      2. The ‘remoteregisterdb’ event script registers node1 as a secondary HANA system replication site using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’) and starts the database instance on node1.

Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.

Standby Server Reboot Test
Description
The SAP HANA resource hierarchy behaves as expected when the standby server is gracefully rebooted.
Preconditions
Before performing this test, ensure that:

  • Both servers (node1 and node2) are operational,
  • The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
  • HANA system replication is in-sync.
Test Steps
  1. Gracefully reboot node2:

    [root@node2 ~]# reboot now
Expected Results
  1. While node2 shuts down, the database instance is stopped on node2. HANA system replication will be inactive until node2 reboots and the secondary database instance is restarted.

  2. After node2 is back online:
    1. During the first quickCheck cycle on node1 after node2 is back online, the SAP HANA Recovery Kit detects that the database instance is not running on node2 and fires a ‘remoteregisterdb’ event.
    2. The ‘remoteregisterdb’ event script starts the database instance on node2.

Note: The process of restarting the database on node2 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node2 as ‘Standby – In Sync’.

Machine Failover

The test case in this section verifies the expected behavior of the SAP HANA resource hierarchy when the primary server is forcefully rebooted.


Machine Failover Test
Description
The SAP HANA resource hierarchy behaves as expected when the primary server is forcefully rebooted.
Preconditions
Before performing this test, ensure that:

  • Both servers (node1 and node2) are operational,
  • The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
  • HANA system replication is in-sync.
Test Steps
  1. Forcefully reboot node1:

    [root@node1 ~]# echo b > /proc/sysrq-trigger
Expected Results
  1. Once LifeKeeper on node2 detects that node1 is down (the exact time will vary depending on the values being used for the LifeKeeper heartbeat parameters), the SAP HANA resource hierarchy is successfully brought in-service on node2. During this process:
    1. Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2.
    2. The running database instance on node2 is promoted to the primary replication role in HANA system replication. Since node1 has been shut down, HANA system replication is currently inactive.

  2. After node1 is back online, the SAP HANA resource hierarchy remains in-service on node2.
    1. During the first quickCheck cycle on node2 after node1 is back online, the SAP HANA Recovery Kit detects that the database instance is not running on node1 and fires a ‘remoteregisterdb’ event.
    2. The ‘remoteregisterdb’ event script registers node1 as a secondary HANA system replication site using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’) and starts the database instance on node1.


Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.

SAP Host Agent Failure

The test cases in this section verify the expected behavior of the SAP HANA resource hierarchy when supporting SAP Host Agent-related processes fail on each server.


Primary Server SAP Host Exec Failure Test
Description
The SAP HANA resource hierarchy behaves as expected when the saphostexec process is killed on the primary server.
Preconditions
Before performing this test, ensure that:

Both servers (node1 and node2) are operational,
The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
HANA system replication is in-sync.
Test Steps
Stop the saphostexec process on node1:


[root@node1 ~]# /usr/sap/hostctrl/exe/saphostexec -stop
Expected Results
  1. During the next quickCheck interval, the SAP HANA Recovery Kit detects that the saphostexec process is no longer running on node1 and fires a ‘recover’ event to restart it.
  2. The ‘recover’ event script restarts the saphostexec process on node1.

Standby Server SAP Host Exec Failure Test
Description
The SAP HANA resource hierarchy behaves as expected when the saphostexec process is killed on the standby server.
Preconditions
Before performing this test, ensure that:

  • Both servers (node1 and node2) are operational,
  • The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
  • HANA system replication is in-sync.
Test Steps
  1. Stop the saphostexec process on node2:

    [root@node2 ~]# /usr/sap/hostctrl/exe/saphostexec -stop
Expected Results
  1. During the next quickCheck interval, the SAP HANA Recovery Kit detects that the saphostexec process is no longer running on node2 and fires a ‘remoteregisterdb’ event to restart it.
  2. The ‘remoteregisterdb’ event script restarts the saphostexec process on node2.

Primary Server SAP OS Collector Failure Test
Description
The SAP HANA resource hierarchy behaves as expected when the saposcol process is killed on the primary server.
Preconditions
Before performing this test, ensure that:

  • Both servers (node1 and node2) are operational,
  • The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
  • HANA system replication is in-sync.
Test Steps
Stop the saposcol process on node1:

[root@node1 ~]# /usr/sap/hostctrl/exe/saposcol -k
Expected Results
  1. During the next quickCheck interval, the SAP HANA Recovery Kit detects that the saposcol process is no longer running on node1 and fires a ‘recover’ event to restart it.
  2. The ‘recover’ event script restarts the saposcol process on node1.

Standby Server SAP OS Collector Failure Test
Description
The SAP HANA resource hierarchy behaves as expected when the saposcol process is killed on the standby server.
Preconditions
Before performing this test, ensure that:

  • Both servers (node1 and node2) are operational,
  • The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
  • HANA system replication is in-sync.
Test Steps
Stop the saposcol process on node2:

[root@node2 ~]# /usr/sap/hostctrl/exe/saposcol -k
Expected Results
  1. During the next quickCheck interval, the SAP HANA Recovery Kit detects that the saposcol process is no longer running on node2 and fires a ‘remoteregisterdb’ event to restart it.
  2. The ‘remoteregisterdb’ event script restarts the saposcol process on node2.

Database Failure

The test cases in this section verify the expected behavior of the SAP HANA resource hierarchy when processes within the protected HANA database instance fail.

Notes:

  • If the database instance is stopped gracefully when performing these tests (e.g., with an HDB stop command), the background process which gracefully stops the database may conflict with the SAP HANA Recovery Kit’s attempt to restart the database locally and may lead to a failover of the SAP HANA resource hierarchy. For this reason, we recommend simulating a crash of the HDB instance by forcefully and immediately killing the database processes at the operating system level, for example with the HDB kill-9 command. See SAP HANA – Known Issues / Restrictions for more details.

Primary Server Database Instance Failure Test
Description
The SAP HANA resource hierarchy behaves as expected when the HDB instance processes are killed on the primary server.
Preconditions
Before performing this test, ensure that:

  • Both servers (node1 and node2) are operational,
  • The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
  • HANA system replication is in-sync.
Test Steps
As the HANA administrative user (<sid>adm), forcefully kill the HDB instance processes (at the operating system level) on node1:

[root@node1 ~]# su – spsadm -c “HDB kill-9”
Expected Results
  1. [If local recovery is enabled for the HANA-SPS_HDB00 resource on node1] The SAP HANA Recovery Kit detects the failure and restarts the database instance on node1:
    1. During the next quickCheck interval, the SAP HANA Recovery Kit detects that the HDB instance processes are no longer running on node1 and fires a ‘recover’ event to restart them.
    2. The ‘recover’ event script restarts the HDB instance processes on node1.

  2. [If local recovery is disabled for the HANA-SPS_HDB00 resource on node1] LifeKeeper immediately initiates a failover of the SAP HANA resource hierarchy to node2:
    1. The SAP HANA resource hierarchy is successfully taken out of service on node1. During this process, any dependent resources (such as an associated virtual IP address, if applicable) are removed on node1.
    2. The SAP HANA resource hierarchy is successfully brought in-service on node2. During this process:
      1. Any dependent resources (such as an associated virtual IP resource, if applicable) are brought in-service on node2.
      2. The running database instance on node2 is promoted to the primary replication role in HANA system replication.
      3. The database instance on node1 is registered as a secondary replication site in HANA system replication using the appropriate HSR parameters (e.g., site name ‘SiteA’, replication mode ‘sync’, and operation mode ‘logreplay’).
      4. The database instance is started on node1.

Note: The process of registering and restarting the database on node1 may take several minutes. Once this process is complete, the LifeKeeper GUI will show the state of the HANA-SPS_HDB00 resource on node1 as ‘Standby – In Sync’.

Standby Server Database Instance Failure Test
Description
The SAP HANA resource hierarchy behaves as expected when the HDB instance processes are killed on the standby server.
Preconditions
Before performing this test, ensure that:

  • Both servers (node1 and node2) are operational,
  • The SAP HANA resource hierarchy is in-service (ISP) on node1 and out-of-service (OSU) on node2, and
  • HANA system replication is in-sync.
Test Steps
As the HANA administrative user (<sid>adm), forcefully kill the HDB instance processes (at the operating system level) on node2:

[root@node2 ~]# su – spsadm -c “HDB kill-9”
Expected Results
  1. During the next quickCheck interval, the SAP HANA Recovery Kit detects that the HDB instance processes are no longer running on node2 and fires a ‘remoteregisterdb’ event to restart them.
  2. The ‘remoteregisterdb’ event script restarts the HDB instance processes on node2.

Appendix: Useful SAP HANA Administrative Commands

While the status of the SAP HANA environment may be monitored through SAP-provided dashboards (e.g., HANA Studio or HANA Cockpit), the following commands may also be useful while testing. Throughout, <sid> denotes the lowercase SAP SID for the protected SAP HANA database installation and <InstNum> denotes the instance number of the protected HDB instance (e.g., for instance HDB00, <InstNum> is 00).

Command Description
su – <sid>adm -c “sapcontrol -nr <InstNum> -function StopService” Stop the sapstartsrv process for the HDB instance.
su – <sid>adm -c “sapcontrol -nr <InstNum> -function StartService <SID>” Start the sapstartsrv process for the HDB instance.
su – <sid>adm -c “sapcontrol -nr <InstNum> -function GetProcessList” View current status of HDB instance processes.
su – <sid>adm -c “HDB stop” Gracefully stop the HDB instance.
su – <sid>adm -c “HDB kill-9” Forcefully kill the HDB instance processes.
su – <sid>adm -c “HDB start” Start the HDB instance.
su – <sid>adm -c “hdbnsutil -sr_state” Check the HANA system replication state on the local server.
su – <sid>adm -c “python /hana/shared/<SID>/HDB<InstNum>/exe/python_support/systemReplicationStatus.py” Check the current HANA system replication status. This command must be executed on the server which is the primary HANA system replication site.
su – <sid>adm -c “hdbsql -n <HANA virtual hostname> -i <InstNum> -u SYSTEM -p <SYSTEM user password> -d <SID> ‘\s’” Test the connection to the <SID> tenant database through the associated virtual hostname.
/usr/sap/hostctrl/exe/saphostexec -status Check the status of saphostexec.
/usr/sap/hostctrl/exe/saphostexec -stop Stop saphostexec.
/usr/sap/hostctrl/exe/saphostexec -restart Restart saphostexec.
/usr/sap/hostctrl/exe/saposcol -s Check the status of saposcol.
/usr/sap/hostctrl/exe/saposcol -k Stop saposcol.
/usr/sap/hostctrl/exe/saposcol -l Start saposcol.
top -U <sid>adm
ps -ef | grep <sid>adm
View information for processes owned by the <sid>adm user.

Feedback

Thanks for your feedback.

Post your comment on this topic.

Post Comment