Open topic with navigation
lkscsid - LifeKeeper SCSI Reservation Daemon
lkscsid is a daemon process that maintains SCSI reservations and performs health monitoring for shared SCSI devices that are under LifeKeeper protection. Periodically, lkscsid reads the list of SCSI devicesfor which LifeKeeper has obtained a SCSI reservation. (LifeKeeper uses SCSI reservations to ensure that only one system in a cluster is allowed access to a given shared SCSI device at a time.)
lkscsid will take the following actions if an error condition is detected:
If a SCSI device does not respond to an inquiry, the LifeKeeper core is informed of this event in order to switch control of the SCSI device to another LifeKeeper system.
If a SCSI device cannot be re-reserved (i.e., another system in the cluster has "stolen" the reservation), then lkscsid will halt the system and reboot. This is done to avoid corruption of data on the shared SCSIdevice.
It is intended that lkscsid be run via init(8) from a line in /etc/inittab (see inittab(5)). This entry is controlled by LifeKeeper and should not be modified manually.
Operation of lkscsid may be customized. All configuration constants have default values, but they may be set to other values with the use of the /etc/default/LifeKeeper configuration file. This file contains lines with the syntax name=value. Comments may terminate any line, and are introduced with the `#' character. Empty lines are ignored.
The configuration constants (with their default values) are:
FAILFASTTIMER (Default: 5)
The length of time in seconds between device checking cycles. This number must be specified as an integer (e.g., 5, not 5.0 or 0x5).
SCSIHALT (Default: TRUE, halt the system)
Specify whether or not to halt the system and reboot for critical failures such as the loss of SCSI reservation. When this is set to 'FALSE' the system will print a message that LifeKeeper intended to halt the system but was prevented due to this setting. Operator intervention will be required when this occurs.
Note: The default setting of 'TRUE' is recommended on productionsystems. Setting this parameter to 'FALSE' is primarily intended to be used during debug only.
SCSIEVENT (Default: TRUE, send an event)
Specify whether or not to send an event to the LifeKeeper core when a SCSI device fails or other SCSI error occurs.
SCSIERROR (Default: event)
When a SCSI device cannot be opened or accessed, or another SCSI error occurs (e.g., timeout) this flag determines the action to take:
event: The LifeKeeper core should be informed that a device needs to be switched over to a backup system. If the switch over of the device fails then the resource will not be brought in-service on the backup system. The MAXSENDEVENTRETRY parameter will determine how many attempts will be made to switch over the hierarchy the device is in. When this retry count is exceeded the system may be halted depending on the SCSIHALT setting.
halt: The system should immediately be halted (if SCSIHALT is set to TRUE) and rebooted to avoid data corruption. This is more reliable than the 'event' setting but can cause transient data (non-committed) tobe lost.
Note: SCSIERROR does not override the action taken in the case of a lost SCSI reservation (see RESERVATIONCONFLICT parameter). The SCSIEVENT and SCSIHALT flags determine whether the action (event or halt) is actually carried out.
SCSIHANGMAX (Default: 24)
When a device does not respond within the FAILFASTTIMER period the deviceis considered "hung". On a busy system or during heavy I/O spikes a device may not respond within the FAILFASTTIMER. In many cases it is not desirable to initiate a failover for this situation. This parameter allows the FAILFASTTIMER to be set at a short interval to ensure a RESERVATION CONFLICT is detected quickly while avoiding a failure due to a busy system.
The parameter defines the number of FAILFASTTIMER loops that a device must respond within before it is considered hung and recovery is initiated. For example, with the default setting of 24 and the default FAILFASTTIMER set to 5 a device will not be considered "hung" unless it is not responding for 120 seconds.
Setting this parameter too low may cause unnecessary recovery of devices, while a setting too high will cause LifeKeeper to delay in recovering a truly hung device.
RESERVATIONCONFLICT (Default: halt)
The RESERVATIONCONFLICT parameter defines the action to take when a device returns a reservation conflict status.
event: The LifeKeeper core should be informed that a device needs to be switched over to a backup system. If the switch over of the device fails then the resource will not be brought in-service on the backup system. The MAXSENDEVENTRETRY parameter will determine how long the sendevent is given to take the resource out-of-service. When this retry count is exceeded the system may be halted depending on the SCSIHALT setting.
halt: The system should immediately be halted (if SCSIHALT is set to TRUE) and rebooted to avoid data corruption. This is more reliable than the 'event' setting but can cause transient data (non-committed) to belost.
Note: The default setting of 'halt' is recommended on production systems. Setting this parameter to 'event' is primarily intendedto be used during debug only.
FILTERHUNGMSG (Default: 10)
The FILTERHUNGMSG parameter determines how quickly a message will be logged when a device is not responding. The setting is the percentage of SCSIHANGMAX before a message is logged. For example, with the default SCSIHANGMAX and FAILFASTTIMER values a warning that a device is not responding will not be logged unless it is not responding for > 12 seconds (10% of 120seconds). If a device responds within 12 seconds then no message is logged.
RESERVATIONS (Default: ioctl)
The RESERVATIONS parameter determines the method used to issue SCSI reserveand release commands. SCSI reservations are used to perform I/O fencing to ensure that one and only one system can access the data in a cluster. The valid settings are:
ioctl: Use ioctl commands in the SCSI mid-layer driver to issue reserve and release commands.
none: The lkscsid daemon will only perform health monitoring for shared SCSI devices. It will not maintain the SCSI reservation. This requires another method of I/O fencing to be configured to assure the reliability of the data in a cluster. This setting is only supported by LifeKeeper on a restricted set of configurations and devices.
SKIPNOTREADY (Default: 10)
The SKIPNOTREADY parameter determines how long to stop checking a device after a Not Ready is returned. In particular if a device returns SCSI sense data of Not Ready with an ASC of 0x04 and ASCQ of 0x09 then the lkscsid daemon will stop polling the device for SKIPNOTREADY iterations of the FAILFASTTIMER. When checking resumes the hung count will be reset to 0.
ERRORRETRY (Default: 3)
The ERRORRETRY parameter determines the number of consecutive failures on a device before initiating recovery defined by SCSIERROR.
MAXSENDEVENTRETRY (Default: 10)
The MAXSENDEVENTRETRY parameter determines how long lkscsid will give the sendevent to switchover the device. When this value is exceeded the 'halt' action will be attempted instead of the 'event'. For example, with the default setting of 10 and the default FAILFASTTIMER set to 5 a sendevent will be given 50 seconds to remove the device. After 50 seconds if the device is still in-service then the 'halt' action is called.
DAEMONKEEPDEVICEOPEN (Default: TRUE, keep the device open)
The DAEMONKEEPDEVICEOPEN parameter defines how lkscsid opens and closes devices. A 'TRUE' setting will cause each checking thread to open the device when it is started and to keep it open until the device is no longer in-service. A 'FALSE' setting will cause the checking thread to open and close the device each time it is activated to check the device (with the default FAILFASTTIMER this would be every 5 seconds). A 'TRUE' setting may cause problems for other applications that want/need exclusive access to the device such as when the partition table needs to be re-read. A 'FALSE" setting will require each thread to do more work when checking a device.
DAEMONSTACKSIZE (Default: 262144)
The DAEMONSTACKSIZE parameter defines the stack size allocated for the lkscsid threads (each device that is in-service has its own thread). This parameter should only be modified if threads are failing due to stack overflow. Increasing the size of the stack will increase the amount of memory reserved for the daemons and thus not available for user applications.
DAEMONHALTONCE (Default: FALSE, call halt each time a failureoccurs)
The DAEMONHALTONCE parameter determines how persistent LifeKeeper will be when a error occurs where the system should be halted. The default behavior will allow LifeKeeper to try to halt the system for each failure. A setting of 'TRUE' will cause LifeKeeper to issue the halt once. If the halt command fails or hangs then user intervention will be required as another halt will not be issued until LifeKeeper is restarted.
© 2012 SIOS Technology Corp., the industry's leading provider of business continuity solutions, data replication for continuous data protection.
Open topic with navigation