Part VI. Learning more about DRBD

Chapter 18. DRBD Internals

This chapter gives some background information about some of DRBD's internal algorithms and structures. It is intended for interested users wishing to gain a certain degree of background knowledge about DRBD. It does not dive into DRBD's inner workings deep enough to be a reference for DRBD developers. For that purpose, please refer to the papers listed in the section called “Publications”, and of course to the comments in the DRBD source code.

DRBD meta data

DRBD stores various pieces of information about the data it keeps in a dedicated area. This metadata includes:

This metadata may be stored internally and externally. Which method is used is configurable on a per-resource basis.

Internal meta data

Configuring a resource to use internal meta data means that DRBD stores its meta data on the same physical lower-level device as the actual production data. It does so by setting aside an area at the end of the device for the specific purpose of storing metadata.

Advantage. Since the meta data are inextricably linked with the actual data, no special action is required from the administrator in case of a hard disk failure. The meta data are lost together with the actual data and are also restored together.

Disadvantage. In case of the lower-level device being a single physical hard disk (as opposed to a RAID set), internal meta data may negatively affect write throughput. The performance of write requests by the application may trigger an update of the meta data in DRBD. If the meta data are stored on the same magnetic disk of a hard disk, the write operation may result in two additional movements of the write/read head of the hard disk.

[Caution]Caution

If you are planning to use internal meta data in conjunction with an existing lower-level device that already has data which you wish to preserve, you must account for the space required by DRBD's meta data. Otherwise, upon DRBD resource creation, the newly created metadata would overwrite data at the end of the lower-level device, potentially destroying existing files in the process. To avoid that, you must do one of the following things:

  • Enlarge your lower-level device. This is possible with any logical volume management facility (such as LVM or EVMS) as long as you have free space available in the corresponding volume group or container. It may also be supported by hardware storage solutions.

  • Shrink your existing file system on your lower-level device. This may or may not be supported by your file system.

  • If neither of the two are possible, use external meta data instead.

To estimate the amount by which you must enlarge your lower-level device our shrink your file system, see the section called “Estimating meta data size”.

External meta data

External meta data is simply stored on a separate, dedicated block device distinct from that which holds your production data.

Advantage. For some write operations, using external meta data produces a somewhat improved latency behavior.

Disadvantage. Meta data are not inextricably linked with the actual production data. This means that manual intervention is required in the case of a hardware failure destroying just the production data (but not DRBD meta data), to effect a full data sync from the surviving node onto the subsequently replaced disk.

Use of external meta data is also the only viable option if all of the following apply:

  • You are using DRBD to duplicate an existing device that already contains data you wish to preserve, and

  • that existing device does not support enlargement, and

  • the existing file system on the device does not support shrinking.

To estimate the required size of the block device dedicated to hold your device meta data, see the section called “Estimating meta data size”.

Estimating meta data size

You may calculate the exact space requirements for DRBD's meta data using the following formula:

Equation 18.1. Calculating DRBD meta data size (exactly)


Cs is the data device size in sectors.

[Note]Note

You may retrieve the device size by issuing blockdev --getsz device.

The result, Ms, is also expressed in sectors. To convert to MB, divide by 2048 (for a 512-byte sector size, which is the default on all Linux platforms except s390).

In practice, you may use a reasonably good approximation, given below. Note that in this formula, the unit is megabytes, not sectors:

Equation 18.2. Estimating DRBD meta data size (approximately)


Generation Identifiers

DRBD uses generation identifiers (GI's) to identify generations of replicated data. This is DRBD's internal mechanism used for

  • determining whether the two nodes are in fact members of the same cluster (as opposed to two nodes that were connected accidentally),

  • determining the direction of background re-synchronization (if necessary),

  • determining whether full re-synchronization is necessary or whether partial re-synchronization is sufficient,

  • identifying split brain.

Data generations

DRBD marks the start of a new data generation at each of the following occurrences:

  • The initial device full sync,

  • a disconnected resource switching to the primary role,

  • a resource in the primary role disconnecting.

Thus, we can summarize that whenever a resource is in the Connected connection state, and both nodes' disk state is UpToDate, the current data generation on both nodes is the same. The inverse is also true.

Every new data generation is identified by a 8-byte, universally unique identifier (UUID).

The generation identifier tuple

DRBD keeps four pieces of information about current and historical data generations in the local resource meta data:

  • Current UUID. This is the generation identifier for the current data generation, as seen from the local node's perspective. When a resource is Connected and fully synchronized, the current UUID is identical between nodes.

  • Bitmap UUID. This is the UUID of the generation against which the on-disk sync bitmap is tracking changes. As the on-disk sync bitmap itself, this identifier is only relevant while in disconnected mode. If the resource is Connected, this UUID is always empty (zero).

  • Two Historical UUID's. These are the identifiers of the two data generations preceding the current one.

Collectively, these four items are referred to as the generation identifier tuple, or GI tuple for short.

How generation identifiers change

Start of a new data generation

When a node loses connection to its peer (either by network failure or manual intervention), DRBD modifies its local generation identifiers in the following manner:

Figure 18.1. GI tuple changes at start of a new data generation

GI tuple changes at start of a new data generation
Note: only changes on primary node shown (on a secondary node, no changes apply).


  1. A new UUID is created for the new data generation. This becomes the new current UUID for the primary node.

  2. The previous UUID now refers to the generation the bitmap is tracking changes against, so it becomes the new bitmap UUID for the primary node.

  3. On the secondary node, the GI tuple remains unchanged.

Start of re-sychronization

Upon the initiation of re-synchronization, DRBD performs these modifications on the local generation identifiers:

Figure 18.2. GI tuple changes at start of re-synchronization

GI tuple changes at start of re-synchronization
Note: only changes on synchronization source shown.


  1. The current UUID on the synchronization source remains unchanged.

  2. The bitmap UUID on the synchronization source is rotated out to the first historical UUID.

  3. A new bitmap UUID is generated on the synchronization source.

  4. This UUID becomes the new current UUID on the synchronization target.

  5. The bitmap and historical UUID's on the synchronization target remain unchanged.

Completion of re-synchronization

When re-synchronization concludes, the following changes are performed:

Figure 18.3. GI tuple changes at completion of re-synchronization

GI tuple changes at completion of re-synchronization
Note: only changes on synchronization source shown.


  1. The current UUID on the synchronization source remains unchanged.

  2. The bitmap UUID on the synchronization source is rotated out to the first historical UUID, with that UUID moving to the second historical entry (any existing second historical entry is discarded).

  3. The bitmap UUID on the synchronization source is then emptied (zeroed).

  4. The synchronization target adopts the entire GI tuple from the synchronization source.

How DRBD uses generation identifiers

When a connection between nodes is established, the two nodes exchange their currently available generation identifiers, and proceed accordingly. A number of possible outcomes exist:

  1. Current UUID's empty on both nodes. The local node detects that both its current UUID and the peer's current UUID are empty. This is the normal occurrence for a freshly configured resource that has not had the initial full sync initiated. No synchronization takes place; it has to be started manually.

  2. Current UUID's empty on one node. The local node detects that the peer's current UUID is empty, and its own is not. This is the normal case for a freshly configured resource on which the initial full sync has just been initiated, the local node having been selected as the initial synchronization source. DRBD now sets all bits in the on-disk sync bitmap (meaning it considers the entire device out-of-sync), and starts synchronizing as a synchronization source.

    If the opposite case (local current UUID empty, peer's non-empty), DRBD performs the same steps, except that the local node becomes the synchronization target.

  3. Equal current UUID's. The local node detects that its current UUID and the peer's current UUID are non-empty and equal. This is the normal occurrence for a resource that went into disconnected mode at a time when it was in the secondary role, and was not promoted on either node while disconnected. No synchronization takes place, as none is necessary.

  4. Bitmap UUID matches peer's current UUID. The local node detects that its bitmap UUID matches the peer's current UUID, and that the peer's bitmap UUID is empty. This is the normal and expected occurrence after a secondary node failure, with the local node being in the primary role. It means that the peer never became primary in the meantime and worked on the basis of the same data generation all along. DRBD now initiates a normal, background re-synchronization, with the local node becoming the synchronization source.

    If, conversely, the local node detects that its bitmap UUID is empty, and that the peer's bitmap matches the local node's current UUID, then that is the normal and expected occurrence after a failure of the local node. Again, DRBD now initiates a normal, background re-synchronization, with the local node becoming the synchronization target.

  5. Current UUID matches peer's historical UUID. The local node detects that its current UUID matches one of the peer's historical UUID's. This implies that while the two data sets share a common ancestor, and the local node has the up-to-date data, the information kept in the local node's bitmap is outdated and not useable. Thus, a normal synchronization would be insufficient. DRBD now marks the entire device as out-of-sync and initiates a full background re-synchronization, with the local node becoming the synchronization source.

    In the opposite case (one of the local node's historical UUID matches the peer's current UUID), DRBD performs the same steps, except that the local node becomes the synchronization target.

  6. Bitmap UUID's match, current UUID's do not.  The local node detects that its current UUID differs from the peer's current UUID, and that the bitmap UUID's match. This is split brain, but one where the data generations have the same parent. This means that DRBD invokes split brain auto-recovery strategies, if configured. Otherwise, DRBD disconnects and waits for manual split brain resolution.

  7. Neither current nor bitmap UUID's match. The local node detects that its current UUID differs from the peer's current UUID, and that the bitmap UUID's do not match. This is split brain with unrelated ancestor generations, thus auto-recovery strategies, even if configured, are moot. DRBD disconnects and waits for manual split brain resolution.

  8. No UUID's match. Finally, in case DRBD fails to detect even a single matching element in the two nodes' GI tuples, it logs a warning about unrelated data and disconnects. This is DRBD's safeguard against accidental connection of two cluster nodes that have never heard of each other before.

The Activity Log

Purpose

During a write operation DRBD forwards the write operation to the local backing block device, but also sends the data block over the network. These two actions occur, for all practical purposes, simultaneously. Random timing behavior may cause a situation where the write operation has been completed, but the transmission via the network has not yet taken place.

If, at this moment, the active node fails and fail-over is being initiated, then this data block is out of sync between nodes — it has been written on the failed node prior to the crash, but replication has not yet completed. Thus, when the node eventually recovers, this block must be removed from the data set of during subsequent synchronisation. Otherwise, the crashed node would be "one write ahead" of the surviving node, which would violate the all or nothing principle of replicated storage. This is an issue that is not limited to DRBD, in fact, this issue exists in practically all replicated storage configurations. Many other storage solutions (just as DRBD itself, prior to version 0.7) thus require that after a failure of the active, that node must be fully synchronized anew after its recovery.

DRBD's approach, since version 0.7, is a different one. The activity log (AL), stored in the meta data area, keeps track of those blocks that have recently been written to. Colloquially, these areas are referred to as hot extents.

If a temporarily failed node that was in active mode at the time of failure is synchronized, only those hot extents highlighted in the AL need to be synchronized, rather than the full device. This drastically reduces synchronization time after an active node crash.

Active extents

The activity log has a configurable parameter, the number of active extents. Every active extent adds 4MiB to the amount of data being retransmitted after a Primary crash. This parameter must be understood as a compromise between the following opposites:

  • Many active extents. Keeping a large activity log improves write throughput. Every time a new extent is activated, an old extent is reset to inactive. This transition requires a write operation to the meta data area. If the number of active extents is high, old active extents are swapped out fairly rarely, reducing meta data write operations and thereby improving performance.

  • Few active extents. Keeping a small activity log reduces synchronization time after active node failure and subsequent recovery.

Selecting a suitable Activity Log size

The definition of the number of extents should be based on the desired synchronisation time at a given synchronization rate. The number of active extents can be calculated as follows:

Equation 18.3. Active extents calculation based on sync rate and target sync time


R is the synchronization rate, given in MB/s. tsync is the target synchronization time, in seconds. E is the resulting number of active extents.

To provide an example, suppose our cluster has an I/O subsystem with a throughput rate of 90 MiByte/s that was configured to a synchronization rate of 30 MiByte/s (R=30), and we want to keep our target synchronization time at 4 minutes or 240 seconds (tsync=240):

Equation 18.4. Active extents calculation based on sync rate and target sync time (example)


The exact result is 1800, but since DRBD's hash function for the implementation of the AL works best if the number of extents is set to a prime number, we select 1801.

The quick-sync bitmap

The quick-sync bitmap is the internal data structure which DRBD uses, on a per-resource basis, to keep track of blocks being in sync (identical on both nodes) or out-of sync. It is only relevant when a resource is in disconnected mode.

In the quick-sync bitmap, one bit represents a 4-KiB chunk of on-disk data. If the bit is cleared, it means that the corresponding block is still in sync with the peer node. That implies that the block has not been written to since the time of disconnection. Conversely, if the bit is set, it means that the block has been modified and needs to be re-synchronized whenever the connection becomes available again.

As DRBD detects write I/O on a disconnected device, and hence starts setting bits in the quick-sync bitmap, it does so in RAM — thus avoiding expensive synchronous metadata I/O operations. Only when the corresponding blocks turn cold (that is, expire from the Activity Log), DRBD makes the appropriate modifications in an on-disk representation of the quick-sync bitmap. Likewise, if the resource happens to be manually shut down on the remaining node while disconnected, DRBD flushes the complete quick-sync bitmap out to persistent storage.

When the peer node recovers or the connection is re-established, DRBD combines the bitmap information from both nodes to determine the total data set that it must re-synchronize. Simultaneously, DRBD examines the generation identifiers to determine the direction of synchronization.

The node acting as the synchronization source then transmits the agreed-upon blocks to the peer node, clearing sync bits in the bitmap as the synchronization target acknowledges the modifications. If the re-synchronization is now interrupted (by another network outage, for example) and subsequently resumed it will continue where it left off — with any additional blocks modified in the meantime being added to the re-synchronization data set, of course.

[Note]Note

Re-synchronization may be also be paused and resumed manually with the drbdadm pause-sync and drbdadm resume-sync commands. You should, however, not do so light-heartedly — interrupting re-synchronization leaves your secondary node's disk Inconsistent longer than necessary.

The peer fencing interface

DRBD has a defined interface for the mechanism that fences the peer node in case of the replication link being interrupted. The drbd-peer-outdater helper, bundled with Heartbeat, is the reference implementation for this interface. However, you may easily implement your own peer fencing helper program.

The fencing helper is invoked only in case

  1. a fence-peer handler has been defined in the resource's (or common) handlers section, and

  2. the fencing option for the resource is set to either resource-only or resource-and-stonith, and

  3. the replication link is interrupted long enough for DRBD to detect a network failure.

The program or script specified as the fence-peer handler, when it is invoked, has the DRBD_RESOURCE and DRBD_PEER environment variables available. They contain the name of the affected DRBD resource and the peer's hostname, respectively.

Any peer fencing helper program (or script) must return one of the following exit codes:

Table 18.1. fence-peer handler exit codes

Exit codeImplication
3Peer's disk state was already Inconsistent.
4Peer's disk state was successfully set to Outdated (or was Outdated to begin with).
5Connection to the peer node failed, peer could not be reached.
6Peer refused to be outdated because the affected resource was in the primary role.
7Peer node was successfully fenced off the cluster. This should never occur unless fencing is set to resource-and-stonith for the affected resource.


Chapter 19. Getting more information

Commercial DRBD support

Commercial DRBD support, consultancy, and training services are available from the project's sponsor company, LINBIT.

Public mailing list

The public mailing list for general usage questions regarding DRBD is . This is a subscribers-only mailing list, you may subscribe at http://lists.linbit.com/drbd-user. A complete list archive is available at http://lists.linbit.com/pipermail/drbd-user.

Public IRC Channels

Some of the DRBD developers can occasionally be found on the irc.freenode.net public IRC server, particularly in the following channels:

  • #drbd,

  • #linux-ha,

  • #linux-cluster,

Getting in touch on IRC is a good way of discussing suggestions for improvements in DRBD, and having developer level discussions.

Official Twitter account

LINBIT maintains an official twitter account, linbit_drbd.

If you tweet about DRBD, please include the #drbd hashtag.

Publications

DRBD's authors have written and published a number of papers on DRBD in general, or a specific aspect of DRBD. Here is a short selection:

Lars Ellenberg. DRBD v8.0.x and beyond. 2007. Available at http://www.drbd.org/fileadmin/drbd/publications/drbd8.linux-conf.eu.2007.pdf.

Philipp Reisner. DRBD v8 - Replicated Storage with Shared Disk Semantics. 2007. Available at http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf.

Philipp Reisner. Rapid resynchronization for replicated storage. 2006. Available at http://www.drbd.org/fileadmin/drbd/publications/drbd-activity-logging_v6.pdf.

Other useful resources