Software RAID Repair

If one of the legs of a mirror fails, a repair can be done on that leg.

If a problem occurs, the resource will be marked OSF. (Note: An email notification will occur if enabled.)

Figure 16: LifeKeeper Hierarchy With Failed Component

The mdComponent could be marked OSF while the disk is okay, but the component is marked “faulty” in the mirror. This can be due to some issue detected by mdadm when the device was brought on-line (check the error log for further information) or could be due to a manual operation where the mdadm utility was used to “break” the mirror.

The mdComponent as well as the underlying disk/device could be marked OSF if they failed during the in-service operation. For example, the disk was “broken” or physically not connected when the virtual device was started.

The following screen shots depict an array failure from before the array failed and initial handling of that failure to updating the state to “failed” and bringing it back in service. (These screen shots include an example using a “terminal resource” to tie the bottom of each hierarchy to a single resource.)

Figure 17 – Before Failure of Array

Figure 18 – After Failure of Array

When the failure of the array is initially handled, all resources will be marked OSF. During this failure, IOs continue to the good component or leg of the mirror.

Figure 19 – Failed Disk Array

Figure 20 – Updating Failed Component to Standby

If the failed component was successfully removed from the mirror configuration during the error handling, the resource will transition to OSU. This is done when the MD quickCheck runs after the failure. If, during the handling, the failed component could not be removed from the mirror configuration, then the resource will remain in the OSF state.

Figure 21 – Restored Storage Resources

If the server has to reboot while in the failed state, perhaps to repair the failure to the storage, then the storage resources under the failed component will be restored (if it was properly repaired), but the failed component will not automatically be re-added into the mirror. An in-service (from the GUI or using perform_action(1M)) of the failed component will re-add the failed component. This will trigger a resumption of IO to the leg. The mirror will then do a partial resync if an internal bitmap is configured or a full resync will be done otherwise.

Figure 22: Software RAID In-Service Status

If the failed leg is repaired manually in the virtual device, LifeKeeper will automatically detect the change when quickCheck runs. The state of the resource will change to reflect its new state. However, if the resources below the component are failed, aka the device and/or disk, those states will not be updated. To update those states, the GUI or perform_action(1M) must be used to bring the resource(s) in-service.

Figure 23: Software RAID Successful In-Service

*IMPORTANT: When there is a failure that causes resources to be marked OSF and especially failures that result in resources being moved from one system to another (via a sendevent), it is important that the administrator verify that the failed resource is repaired before trying to bring the resource in-service where it failed.

An example is with the MD kit where there is a complete loss of all paths. When all paths to a mirror fail, the MD kit will recover the failure by moving the mirror to the standby system. The kit will try to clean up or remove all parts of the hierarchy on the failed system before trying to bring the parts in-service on the standby system. However, in many cases, these parts or resources cannot be completely cleaned up due to the failure.

When the administrator repairs the failure, the administrator must also make sure all residual OS items are cleaned up. If there is a mounted file system on the failed mirror, this file system often cannot be unmounted, so even though LifeKeeper moves the file system to the standby system, the failed system will show the file system as mounted (via the mount command). This will cause failures if the administrator then moves the LifeKeeper file system hierarchy back to the repaired system.

The administrator needs to not only repair the failed paths but also needs to make sure all parts of the hierarchy are cleaned up (MD device is still not configured, file system is not mounted, application is completely stopped, etc). A clean reboot may be necessary to make sure all aspects of the hierarchy are cleaned up.

Software RAID Reconfiguration

Software RAID Best Practices

Feedback

Post your comment on this topic.

Feedback

Was this helpful?