SIOS DataKeeper creates and protects NetRAID devices. A NetRAID device is a RAID1 device that consists of a local disk or partition and a Network Block Device (NBD) as shown in the diagram below.
A LifeKeeper supported file system can be mounted on a NetRAID device like any other storage device. In this case, the file system is called a replicated file system. LifeKeeper protects both the NetRAID device and the replicated file system.
The NetRAID device is created by building the DataKeeper resource hierarchy. Extending the NetRAID device to another server will create the NBD device and make the network connection between the two servers. SIOS DataKeeper starts replicating data as soon as the NBD connection is made.
The nbd-client process executes on the primary server and connects to the nbd-server process running on the backup server.
Synchronization (and Resynchronization)
After the DataKeeper resource hierarchy is created and before it is extended, it is in a degraded mode; that is, data will be written to the local disk or partition only. Once the hierarchy is extended to the backup (target) system, SIOS DataKeeper synchronizes the data between the two systems and all subsequent writes are replicated to the target. If at any time the data gets “out-of-sync” (i.e., a system or network failure occurs) SIOS DataKeeper will automatically resynchronize the data on the source and target systems. If the mirror was configured to use an intent log (bitmap file), SIOS DataKeeper uses it to determine what data is out-of-sync so that a full resynchronization is not required. If the mirror was not configured to use a bitmap file, then a full resync is performed after any interruption of data replication.
Standard Mirror Configuration
The most common mirror configuration involves two servers with a mirror established between local disks or partitions on each server, as shown below. Server1 is considered the primary server containing the mirror source. Server2 is the backup server containing the mirror target.
N+1 Configuration
A commonly used variation of the standard mirror configuration above is a cluster in which two or more servers replicate data to a common backup server. In this case, each mirror source must replicate to a separate disk or partition on the backup server, as shown below.
Multiple Target Configuration
When used with an appropriate Linux distribution and kernel version 2.6.7 or higher, SIOS DataKeeper can also replicate data from a single disk or partition on the primary server to multiple backup systems, as shown below.
A given source disk or partition can be replicated to a maximum of 7 mirror targets, and each mirror target must be on a separate system (i.e. a source disk or partition cannot be mirrored to more than one disk or partition on the same target system).
This type of configuration allows the use of LifeKeeper’s cascading failover feature, providing multiple backup systems for a protected application and its associated data.
To avoid a full resync to all targets when a mirror is started, the bitmap from the previous source must first be merged before the remaining targets in the cluster can be reconnected. Prior to v9.3.2, if the previous source was not available when the mirror was started, a full resync was automatically done to each target. Starting with v9.3.2, when the mirror is started on a system it will wait for the previous source to join the cluster before connecting targets. When the previous source joins the cluster, its bitmap is merged so that all targets can join with a partial resync. When the mirror is stopped and targets are in-sync, no previous source is needed to start the mirror and replicate to targets. If the previous source is not available to rejoin the cluster, targets can manually be resynced with a full resync using the “mirror_action fullresync” command. The variable LKDR_WAIT_FOR_PREVIOUS_SOURCE_TIMEOUT in /etc/default/LifeKeeper determines the resync behavior (refer to the DataKeeper Parameters List for more information).
SIOS DataKeeper Resource Hierarchy
The following example shows a typical DataKeeper resource hierarchy as it appears in the LifeKeeper GUI:
The resource datarep-ext3-sdr is the NetRAID resource, and the parent resource ext3-sdr is the file system resource. Note that subsequent references to the DataKeeper resource in this documentation refer to both resources together. Because the file system resource is dependent on the NetRAID resource, performing an action on the NetRAID resource will also affect the file system resource above it.
Failover Scenarios
Failover Scenarios – 2 nodes
The following four examples show what happens during a failover using SIOS DataKeeper. In these examples, the LifeKeeper for Linux cluster consists of two servers, Server 1 (primary server) and Server 2 (backup server).
Scenario 1
Server 1 has successfully completed its replication to Server 2 after which Server 1 becomes inoperable.
Result: Failover occurs. Server 2 now takes on the role of primary server and operates in a degraded mode (with no backup) until Server 1 is again operational. SIOS DataKeeper will then initiate a resynchronization from Server 2 to Server 1. This will be a full resynchronization on kernel 2.6.18 and lower. On kernels 2.6.19 and later or with Red Hat Enterprise Linux 5.4 kernels 2.6.18-164 or later (or a supported derivative of Red Hat 5.4 or later), the resynchronization will be partial, meaning only the changed blocks recorded in the bitmap files on the source and target will need to be synchronized.
$LKROOT/subsys/scsi/resources/netraid/$TAG_last_owner
When Server 1 fails over to Server 2, this flag is set on Server 2.Thus, when Server 1 comes back up; SIOS DataKeeper removes the last owner flag from Server1. It then begins resynchronizing the data from Server 2 to Server 1.
Scenario 2
Considering scenario 1, Server 2 (still the primary server) becomes inoperable during the resynchronization with Server 1 (now the backup server).
Result: Because the resynchronization process did not complete successfully, there is potential for data corruption. As a result, LifeKeeper will not attempt to fail over the DataKeeper resource to Server 1. Only when Server 2 becomes operable will LifeKeeper attempt to bring the DataKeeper resource in-service (ISP) on Server 2.
Scenario 3
Both Server 1 (primary) and Server 2 (target) become inoperable. Server 1 (primary) comes back up first.
Result: Server 1 will not bring the DataKeeper resource in-service. The reason is that if a source server goes down, and then it cannot communicate with the target after it comes back online, it sets the following flag:
$LKROOT/subsys/scsi/resources/netraid/$TAG_data_corrupt
This is a safeguard to avoid resynchronizing data in the wrong direction. In this case you will need to force the mirror online on Server1, which will delete the data_corrupt flag and bring the resource into service on Server 1. See Force Mirror Online
Scenario 4
Both Server 1 (primary) and Server 2 (target) become inoperable. Server 2 (target) comes back up first.
Result: LifeKeeper will not bring the DataKeeper resource ISP on Server 2. When Server 1 comes back up, LifeKeeper will automatically bring the DataKeeper resource ISP on Server 1.
Failover Scenario – 3 nodes
The following example shows what happens during a failover using SIOS DataKeeper. In this example the LifeKeeper for Linux cluster consists of three servers, Server 1 (primary server), Server 2 (backup server) and Server 3 (backup server).
Server 1 (priority 1) has successfully completed its replication to Server 2 (priority 10) and Server 3 (priority 20) after which Server 1 becomes inoperable.
Result: Failover occurs to the next highest priority, Server 2. Server 2 now takes on the role of primary server. Prior to release v9.3.2 Server 3 will be added to the mirror with a full resynchronization. With v9.3.2 Server 2 waits for Server 1 (previous server) to return to the cluster before resuming replication to Server 3. This allows the bitmap from Server 1 to be merged with the bitmap from Server 2, allowing for a partial resync to both Server 1 and Server 3. While waiting for Server 1 to reconnect to the cluster, the LifeKeeper GUI will show the status of Server 3 as “Out of Sync (Wait for Previous Source)”. The status of Server 1 will be “Unknown” while the server is not connected. When it initially connects the GUI status will show “Out of Sync”. The properties page for the mirror will identify it as “Out of Sync (Previous Source)”.
Once Server 1 reconnects, its bitmap is merged and a resynchronization begins, at which point its status is shown as “Resyncing”.
When resynchronization completes, its status will update to “Target” and Server 3 will begin resynchronization with its status set to “Resyncing”.
Note: SIOS DataKeeper sets the follow flags to track the mirror source:
$LKROOT/subsys/scsi/resources/netraid/$TAG_last_owner
$LKROOT/subsys/scsi/resources/netraid/$TAG_source
The $TAG_last_owner flag is on the system that is currently acting as the mirror source while the $TAG_source flag contains the name of the system that was source at the last point in time that the local node was part of the mirror.
When Server 1 fails over to Server 2, $TAG_last_owner flag is set on Server 2. The $TAG_source flag on Server 2 identifies Server 1 as the previous source (that has the bitmap needed to do a partial resync to Server 1 and Server 3). When Server 1 comes back up, SIOS DataKeeper removes the $TAG_last_owner flag from Server 1. Server 2 then merges the bitmap from Server 1 and begins resynchronizing the data from Server 2 to Server 1. When resynchronization is complete to Server 1 the $TAG_source flag on Server 1 is updated with the name of Server 2. After Server 1 is synchronized, Server 2 will perform the same resynchronization to Server 3. When that resynchronization is complete to Server 3 the $TAG_source flag on Server 3 is updated with the name of Server 2.
Post your comment on this topic.