Watchdog is a method of monitoring a server to ensure that if the server is not working properly, corrective action (reboot) will be taken so that it does not cause problems. Watchdog can be implemented using special watchdog hardware or using a software-only option.
Components
- Watchdog timer – software driver or an external hardware component
- Watchdog daemon – rpm available through the Linux distribution
- LifeKeeper core daemon – installed with the LifeKeeper installation
- Health check script – Script to check the status of LifeKeeper SSP core
Read the next section carefully. The daemon is designed to recover from errors and will reset the system if not configured carefully. Planning and care should be given to how this is installed and configured. This section is not intended to explain and configure watchdog, but only to explain and configure how LifeKeeper SSP interoperates in such a configuration.
Configuration
The following steps should be carried out by an administrator with root user privileges. The administrator should already be familiar with some of the risks and issues with watchdog.
The health check script (LifeKeeper monitoring script) is the component that ties the LifeKeeper configuration with the watchdog configuration (/opt/LifeKeeper/samples/watchdog/LifeKeeper-watchdog). This script can monitor the basic parts of LifeKeeper core components.
- If watchdog has been previously configured, enter the following command to stop it. If not, go to Step 2.
systemctl stop watchdog
- Edit the watchdog configuration file (/etc/watchdog.conf) supplied during the installation of watchdog software.
- Modify test-binary:
test-binary = /opt/LifeKeeper/samples/watchdog/LifeKeeper-watchdog
- Modify test-timeout:
test-timeout = 5
- Modify interval:
interval = 7
The interval must be greater than or equal to the test-timeout value. The recommended value is between 5 and 10 because if the interval is too long, failure detection will be delayed.
- Make sure LifeKeeper SSP has been started. If not, please refer to the Starting LifeKeeper topic.
- Start watchdog by entering the following command:
systemctl start watchdog
- To start watchdog automatically on future restarts, enter the following command:
systemctl enable watchdog
Note: Configuring watchdog may cause some unexpected reboots from time to time. This is the general nature of how watchdog works. If processes are not responding correctly, the watchdog feature will assume that LifeKeeper (or the operating system) is hung, and it will reboot the system (without warning).
Uninstall
Care should be taken when uninstalling LifeKeeper. The above steps should be done in reverse order as listed below.
- Stop watchdog by entering the following command:
systemctl stop watchdog
- Edit the watchdog configuration file (/etc/watchdog.conf) supplied during the installation of watchdog software.
- Modify test-binary and interval by commenting out those entries (add # at the beginning of each line):
#test-binary =
#interval =
- Uninstall LifeKeeper. See the Removing LifeKeeper topic.
- Watchdog can now be started again. If only used by LifeKeeper, watchdog can be permanently disabled by entering the following command:
systemctl disable watchdog
Post your comment on this topic.