You are here: Troubleshooting > Thread is Hung Messages on Shared Storage

Thread is Hung Messages on Shared Storage

In situations where the device checking threads are not completing fast enough, this can cause messages to be placed in the LifeKeeper log stating that a thread is hung. This can cause resources to be moved from one server to another and in worse case, cause a server to be killed.

Explanation

The FAILFASTTIMER (in /etc/default/LifeKeeper) defines the number of seconds that each device is checked to assure that it is functioning properly, and that all resources that are owned by a particular system are still accessible by that system and owned by it.  The FAILFASTTIMER needs to be as small as possible to guarantee this ownership and to provide the highest data reliability.  However if a device is busy, it may not be able to respond at peak loads in the specified time.  When a device takes longer than the FAILFASTTIMER then LifeKeeper considers that device as possibly hung.  If a device has not responded after 3 loops of the FAILFASTTIMER time period then LifeKeeper attempts to perform recovery as if the device has failed.  The recovery process is defined by the tunable SCSIERROR.  Depending on the setting of SCSIERROR the action can be a sendevent to perform local recovery and then a switchover if that fails or it can cause the system to halt.

Suggested Action:

In cases where a device infrequently has a hung message printed to the error log followed by a message that it is no longer hung and the number in parenthesis is always 1, there should be no reason for alarm.  However, if this message is frequently in the log, or the number is 2 or 3, then two actions may be necessary:

Note: When the FAILFASTTIMER value is modified LifeKeeper must be stopped and restarted before the new value will take affect.

© 2012 SIOS Technology Corp., the industry's leading provider of business continuity solutions, data replication for continuous data protection.