A resource will not come in service on a secondary node if it is running LifeKeeper for Linux older than v9.7.0

ISSUE: When a secondary node is updated to v9.7.0 and a failover or switchover occurs making it the primary node in the cluster, it will update the configurations of all the other nodes. If another node is running a version of LifeKeeper for Linux previous to v9.7.0, it will fail to come in-service due to the new configuration format not being supported in previous versions.

WORKAROUND/SOLUTION: Upgrade the secondary node to v9.7.0 to allow for normal failover/switchover operations. To bring the resource in-service on an older release, restore the configuration from a backup, see lkbackup for more information.
Local recovery may fail if the local database is stopped with a sapcontrol Stop/StopWait command, resulting in a failover of the SAP HANA resource hierarchy

ISSUE: When a user issues a Stop/StopWait request for the HDB instance using the sapcontrol utility (which is also what is used internally when a user issues an ‘HDB stop’ command), sapstartsrv begins an asynchronous process of gracefully stopping all of the HANA database processes, and does not stop this process until either the database is completely shut down or the process times out. Therefore any other action issued via sapcontrol while sapstartsrv is in the process of gracefully shutting down the database will compete with the already-in-progress stop action, and will ultimately fail and time out.

In particular, the following sequence of events may lead to a failover of the SAP HANA resource hierarchy, even when local recovery is enabled for the protected database:

  1. A user initiates a graceful shutdown of the HANA database while it is running on the primary server by issuing a ‘sapcontrol Stop/StopWait’ or ‘HDB stop’ command.
  2. The ‘quickCheck’ script in the SAP HANA Recovery Kit detects that at least one database process is no longer running, which results in an attempt to locally restart the database.
  3. The ‘recover’ script in the SAP HANA Recovery Kit issues a ‘sapcontrol StartWait’ command to attempt to restart the protected HDB instance.
  4. Because the ‘sapcontrol Stop/StopWait’ command issued in step 1 is still actively stopping the HANA database processes, the ‘sapcontrol Start’ command issued by the SAP HANA Recovery Kit fails and times out.
  5. Since the SAP HANA Recovery Kit is unable to restart the database locally, local recovery fails and the SAP HANA resource hierarchy fails over to the standby server.

WORKAROUND/SOLUTION: If the database is being stopped manually as part of pre-production cluster testing to simulate local recovery after a failure of the primary database, consider forcefully killing the database processes (e.g., with ‘HDB kill-9’) to more accurately simulate a primary database crash. See Testing Your SAP HANA Resource Hierarchy for sample test cases.


Was this helpful?

Yes No
You indicated this topic was not helpful to you ...
Could you please leave a comment telling us why? Thank you!
Thanks for your feedback.

Post your comment on this topic.

Post Comment