Part I. Introduction to DRBD

Chapter 1. DRBD Fundamentals

The Distributed Replicated Block Device (DRBD) is a software-based, shared-nothing, replicated storage solution mirroring the content of block devices (hard disks, partitions, logical volumes etc.) between servers.

DRBD mirrors data

  • In real time. Replication occurs continuously, while applications modify the data on the device.

  • Transparently. The applications that store their data on the mirrored device are oblivious of the fact that the data is in fact stored on several computers.

  • Synchronously or asynchronously. With synchronous mirroring, a writing application is notified of write completion only after the write has been carried out on both computer systems. Asynchronous mirroring means the writing application is notified of write completion when the write has completed locally, but before the write has propagated to the peer system.

Kernel module

DRBD's core functionality is implemented by way of a Linux kernel module. Specifically, DRBD constitutes a driver for a virtual block device, so DRBD is situated right near the bottom of a system's I/O stack. Because of this, DRBD is extremely flexible and versatile, which makes it a replication solution suitable for adding high availability to just about any application.

[Important]Important

DRBD is, by definition and as mandated by the Linux kernel architecture, agnostic of the layers above it. Thus, it is impossible for DRBD to miraculously add features to upper layers that these do not possess. For example, DRBD cannot auto-detect file system corruption or add active-active clustering capability to file systems like ext3 or XFS.

Figure 1.1. DRBD's position within the Linux I/O stack

DRBD's position within the Linux I/O stack

User space administration tools

DRBD comes with a handful of administration tools capable of communicating with the kernel module, in order to be able to configure and administer DRBD resources.

  • drbdadm. The high-level administration tool of the DRBD program suite. It obtains all DRBD configuration parameters from the configuration file /etc/drbd.conf. drbdadm acts as a front-end application for both drbdsetup and drbdmeta and hands off instructions to either of the two for actual command execution. drbdadm has a dry-run mode, invoked with the -d option, which exposes the commands issued by the back-end programs.

  • drbdsetup. The program that allows users to configure the DRBD module that has been loaded into the running kernel. It is the low-level tool within the DRBD program suite. When using this program, all configuration parameters have to be directly handed over on the command line. This allows for maximum flexibility, albeit at the price of reduced ease of use. Most users will use drbdsetup very rarely.

  • drbdmeta. The program which allows users to create, dump, restore, and modify DRBD's meta data structures. This, too, is a command that most users will use only very rarely.

Resources

In DRBD, resource is the collective term that refers to all aspects of a particular replicated storage device. These include:

  • Resource name. This can be any arbitrary, US-ASCII name not containing whitespace by which the resource is referred to.

  • DRBD device. This is the virtual block device managed by DRBD. It has a device major number of 147, and its minor numbers are numbered from 0 onwards, as is customary. The associated block device is always named /dev/drbdm, where m is the device minor number.

    [Note]Note

    Very early DRBD versions hijacked NBD's device major number 43. This is long obsolete; 147 is the LANANA-registered DRBD device major.

  • Disk configuration. This entails the local copy of the data, and meta data for DRBD's internal use.

  • Network configuration. This entails all aspects of DRBD's communication with the peer node.

Resource roles

In DRBD, every resource has a role, which may be Primary or Secondary.

[Note]Note

The choice of terms here is not arbitrary. These roles were deliberately not named "Active" and "Passive" by DRBD's creators. Primary vs. secondary refers to a concept related to availability of storage, whereas active vs. passive refers to the availability of an application. It is usually the case in a high-availability environment that the primary node is also the active one, but this is by no means necessary.

  • A DRBD device in the primary role can be used unrestrictedly for read and write operations. It may be used for creating and mounting file systems, raw or direct I/O to the block device, etc.

  • A DRBD device in the secondary role receives all updates from the peer node's device, but otherwise disallows access completely. It can not be used by applications, neither for read nor write access. The reason for disallowing even read-only access to the device is the necessity to maintain cache coherency, which would be impossible if a secondary resource were made accessible in any way.

The resource's role can, of course, be changed, either by manual intervention or by way of some automated algorithm by a cluster management application. Changing the resource role from secondary to primary is referred to as promotion, whereas the reverse operation is termed demotion.

Chapter 2. DRBD Features

This chapter discusses various useful DRBD features, and gives some background information about them. Some of these features will be important to most users, some will only be relevant in very specific deployment scenarios.

Chapter 6, Common administrative tasks and Chapter 7, Troubleshooting and error recovery contain instructions on how to enable and use these features during day-to-day operation.

Single-primary mode

In single-primary mode, any resource is, at any given time, in the primary role on only one cluster member. Since it is thus guaranteed that only one cluster node manipulates the data at any moment, this mode can be used with any conventional file system (ext3, ext4, XFS etc.).

Deploying DRBD in single-primary mode is the canonical approach for high availability (fail-over capable) clusters.

Dual-primary mode

This feature is available in DRBD 8.0 and later.

In dual-primary mode, any resource is, at any given time, in the primary role on both cluster nodes. Since concurrent access to the data is thus possible, this mode requires the use of a shared cluster file system that utilizes a distributed lock manager. Examples include GFS and OCFS2.

Deploying DRBD in dual-primary mode is the preferred approach for load-balancing clusters which require concurrent data access from two nodes. This mode is disabled by default, and must be enabled explicitly in DRBD's configuration file.

See the section called “Enabling dual-primary mode” for information on enabling dual-primary mode for specific resources.

Replication modes

DRBD supports three distinct replication modes, allowing three degrees of replication synchronicity.

Protocol A. Asynchronous replication protocol. Local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has been placed in the local TCP send buffer. In the event of forced fail-over, data loss may occur. The data on the standby node is consistent after fail-over, however, the most recent updates performed prior to the crash could be lost.

Protocol B. Memory synchronous (semi-synchronous) replication protocol. Local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has reached the peer node. Normally, no writes are lost in case of forced fail-over. However, in the event of simultaneous power failure on both nodes and concurrent, irreversible destruction of the primary's data store, the most recent writes completed on the primary may be lost.

Protocol C. Synchronous replication protocol. Local write operations on the primary node are considered completed only after both the local and the remote disk write have been confirmed. As a result, loss of a single node is guaranteed not to lead to any data loss. Data loss is, of course, inevitable even with this replication protocol if both nodes (or their storage subsystems) are irreversibly destroyed at the same time.

By far, the most commonly used replication protocol in DRBD setups is protocol C.

[Note]Note

The choice of replication protocol influences two factors of your deployment: protection and latency. Throughput, by contrast, is largely independent of the replication protocol selected.

See the section called “Configuring your resource” for an example resource configuration which demonstrates replication protocol configuration.

Multiple replication transports

This feature is available in DRBD 8.2.7 and later.

DRBD's replication and synchronization framework socket layer supports multiple low-level transports:

  • TCP over IPv4. This is the canonical implementation, and DRBD's default. It may be used on any system that has IPv4 enabled.

  • TCP over IPv6. When configured to use standard TCP sockets for replication and synchronization, DRBD can use also IPv6 as its network protocol. This is equivalent in semantics and performance to IPv4, albeit using a different addressing scheme.

  • SuperSockets. SuperSockets replace the TCP/IP portions of the stack with a single, monolithic, highly efficient and RDMA capable socket implementation. DRBD can use this socket type for very low latency replication. SuperSockets must run on specific hardware which is currently available from a single vendor, Dolphin Interconnect Solutions.

Efficient synchronization

(Re-)synchronization is distinct from device replication. While replication occurs on any write event to a resource in the primary role, synchronization is decoupled from incoming writes. Rather, it affects the device as a whole.

Synchronization is necessary if the replication link has been interrupted for any reason, be it due to failure of the primary node, failure of the secondary node, or interruption of the replication link. Synchronization is efficient in the sense that DRBD does not synchronize modified blocks in the order they were originally written, but in linear order, which has the following consequences:

  • Synchronization is fast, since blocks in which several successive write operations occurred are only synchronized once.

  • Synchronization is also associated with few disk seeks, as blocks are synchronized according to the natural on-disk block layout.

  • During synchronization, the data set on the standby node is partly obsolete and partly already updated. This state of data is called inconsistent.

A node with inconsistent data generally cannot be put into operation, thus it is desirable to keep the time period during which a node is inconsistent as short as possible. The service continues to run uninterrupted on the active node, while background synchronization is in progress.

You may estimate the expected sync time based on the following simple formula:

Equation 2.1. Synchronization time


tsync is the expected sync time. D is the amount of data to be synchronized, which you are unlikely to have any influence over (this is the amount of data that was modified by your application while the replication link was broken). R is the rate of synchronization, which is configurable — bounded by the throughput limitations of the replication network and I/O subsystem.

The efficiency of DRBD's synchronization algorithm may be further enhanced by using data digests, also known as checksums. When using checksum-based synchronization, then rather than performing a brute-force overwrite of blocks marked out of sync, DRBD reads blocks before synchronizing them and computes a hash of the contents currently found on disk. It then compares this hash with one computed from the same sector on the peer, and omits re-writing this block if the hashes match. This can dramatically cut down synchronization times in situation where a filesystem re-writes a sector with identical contents while DRBD is in disconnected mode.

See the section called “Configuring the rate of synchronization” and the section called “Configuring checksum-based synchronization” for configuration suggestions with regard to synchronization.

On-line device verification

This feature is available in DRBD 8.2.5 and later.

On-line device verification enables users to do a block-by-block data integrity check between nodes in a very efficient manner.

[Note]Note

Note that efficient refers to efficient use of network bandwidth here, and to the fact that verification does not break redundancy in any way. On-line verification is still a resource-intensive operation, with a noticeable impact on CPU utilization and load average.

It works by one node (the verification source) sequentially calculating a cryptographic digest of every block stored on the lower-level storage device of a particular resource. DRBD then transmits that digest to the peer node (the verification target), where it is checked against a digest of the local copy of the affected block. If the digests do not match, the block is marked out-of-sync and may later be synchronized. Because DRBD transmits just the digests, not the full blocks, on-line verification uses network bandwidth very efficiently.

The process is termed on-line verification because it does not require that the DRBD resource being verified is unused at the time of verification. Thus, though it does carry a slight performance penalty while it is running, on-line verification does not cause service interruption or system down time — neither during the verification run nor during subsequent synchronization.

It is a common use case to have on-line verification managed by the local cron daemon, running it, for example, once a week or once a month.

See the section called “Using on-line device verification” for information on how to enable, invoke, and automate on-line verification.

Replication traffic integrity checking

This feature is available in DRBD 8.2.0 and later.

DRBD optionally performs end-to-end message integrity checking using cryptographic message digest algorithms such as MD5, SHA-1 or CRC-32C.

[Note]Note

These message digest algorithms are not provided by DRBD. The Linux kernel crypto API provides these; DRBD merely uses them. Thus, DRBD is capable of utilizing any message digest algorithm available in a particular system's kernel configuration.

With this feature enabled, DRBD generates a message digest of every data block it replicates to the peer, which the peer then uses to verify the integrity of the replication packet. If the replicated block can not be verified against the digest, the peer requests retransmission. Thus, DRBD replication is protected against several error sources, all of which, if unchecked, would potentially lead to data corruption during the replication process:

  • Bitwise errors ("bit flips") occurring on data in transit between main memory and the network interface on the sending node (which goes undetected by TCP checksumming if it is offloaded to the network card, as is common in recent implementations);

  • bit flips occuring on data in transit from the network interface to main memory on the receiving node (the same considerations apply for TCP checksum offloading);

  • any form of corruption due to a race conditions or bugs in network interface firmware or drivers;

  • bit flips or random corruption injected by some reassembling network component between nodes (if not using direct, back-to-back connections).

See the section called “Configuring replication traffic integrity checking” for information on how to enable replication traffic integrity checking.

Split brain notification and automatic recovery

Automatic split brain recovery, in its current incarnation, is available in DRBD 8.0 and later. Automatic split brain recovery was available in DRBD 0.7, albeit using only the discard modifications on the younger primary strategy, which was not configurable. Automatic split brain recovery is disabled by default from DRBD 8 onwards.

Split brain notification is available since DRBD 8.2.1.

Split brain is a situation where, due to temporary failure of all network links between cluster nodes, and possibly due to intervention by a cluster management software or human error, both nodes switched to the primary role while disconnected. This is a potentially harmful state, as it implies that modifications to the data might have been made on either node, without having been replicated to the peer. Thus, it is likely in this situation that two diverging sets of data have been created, which cannot be trivially merged.

[Note]Note

DRBD split brain is distinct from cluster split brain, which is the loss of all connectivity between hosts managed by a distributed cluster management application such as Heartbeat. To avoid confusion, this guide uses the following convention:

  • Split brain refers to DRBD split brain as described in the paragraph above.

  • Loss of all cluster connectivity is referred to as a cluster partition, an alternative term for cluster split brain.

DRBD allows for automatic operator notification (by email or other means) when it detects split brain. See the section called “Split brain notification” for details on how to configure this feature.

While the recommended course of action in this scenario is to manually resolve the split brain and then eliminate its root cause, it may be desirable, in some cases, to automate the process. DRBD has several resolution algorithms available for doing so:

  • Discarding modifications made on the younger primary. In this mode, when the network connection is re-established and split brain is discovered, DRBD will discard modifications made, in the meantime, on the node which switched to the primary role last.

  • Discarding modifications made on the older primary. In this mode, DRBD will discard modifications made, in the meantime, on the node which switched to the primary role first.

  • Discarding modifications on the primary with fewer changes. In this mode, DRBD will check which of the two nodes has recorded fewer modifications, and will then discard all modifications made on that host.

  • Graceful recovery from split brain if one host has had no intermediate changes. In this mode, if one of the hosts has made no modifications at all during split brain, DRBD will simply recover gracefully and declare the split brain resolved. Note that this is a fairly unlikely scenario. Even if both hosts only mounted the file system on the DRBD block device (even read-only), the device contents would be modified, ruling out the possibility of automatic recovery.

[Caution]Caution

Whether or not automatic split brain recovery is acceptable depends largely on the individual application. Consider the example of DRBD hosting a database. The discard modifications from host with fewer changes approach may be fine for a web application click-through database. By contrast, it may be totally unacceptable to automatically discard any modifications made to a financial database, requiring manual recovery in any split brain event. Consider your application's requirements carefully before enabling automatic split brain recovery.

Refer to the section called “Automatic split brain recovery policies” for details on configuring DRBD's automatic split brain recovery policies.

Support for disk flushes

When local block devices such as hard drives or RAID logical disks have write caching enabled, writes to these devices are considered completed as soon as they have reached reached the volatile cache. Controller manufacturers typically refer to this as write-back mode, the opposite being write-through. If a power outage occurs on a controller in write-back mode, the most recent pending writes last writes are never committed to the disk, potentially causing data loss.

To counteract this, DRBD makes use of disk flushes. A disk flush is a write operation that completes only when the associated data has been committed to stable (non-volatile) storage — that is to say, it has effectively been written to disk, rather than to the cache. DRBD uses disk flushes for write operations both to its replicated data set and to its meta data. In effect, DRBD circumvents the write cache in situations it deems necessary, as in activity log updates or enforcement of implicit write-after-write dependencies. This means additional reliability even in the face of power failure.

It is important to understand that DRBD can use disk flushes only when layered on top of backing devices that support them. Most reasonably recent kernels support disk flushes for most SCSI and SATA devices. Linux software RAID (md) supports disk flushes for RAID-1, provided all component devices support them too. The same is true for device-mapper devices (LVM2, dm-raid, multipath).

Controllers with battery-backed write cache (BBWC) use a battery to back up their volatile storage. On such devices, when power is restored after an outage, the controller flushes the most recent pending writes out to disk from the battery-backed cache, ensuring all writes committed to the volatile cache are actually transferred to stable storage. When running DRBD on top of such devices, it may be acceptable to disable disk flushes, thereby improving DRBD's write performance. See the section called “Disabling backing device flushes” for details.

Disk error handling strategies

If a hard drive that is used as a backing block device for DRBD on one of the nodes fails, DRBD may either pass on the I/O error to the upper layer (usually the file system) or it can mask I/O errors from upper layers.

Passing on I/O errors. If DRBD is configured to pass on I/O errors, any such errors occuring on the lower-level device are transparently passed to upper I/O layers. Thus, it is left to upper layers to deal with such errors (this may result in a file system being remounted read-only, for example). This strategy does not ensure service continuity, and is hence not recommended for most users.

Masking I/O errors.  If DRBD is configured to detach on lower-level I/O error, DRBD will do so, automatically, upon occurrence of the first lower-level I/O error. The I/O error is masked from upper layers while DRBD transparently fetches the affected block from the peer node, over the network. From then onwards, DRBD is said to operate in diskless mode, and carries out all subsequent I/O operations, read and write, on the peer node. Performance in this mode is inevitably expected to suffer, but the service continues without interruption, and can be moved to the peer node in a deliberate fashion at a convenient time.

See the section called “Configuring I/O error handling strategies” for information on configuring I/O error handling strategies for DRBD.

Strategies for dealing with outdated data

DRBD distinguishes between inconsistent and outdated data. Inconsistent data is data that cannot be expected to be accessible and useful in any manner. The prime example for this is data on a node that is currently the target of an on-going synchronization. Data on such a node is part obsolete, part up to date, and impossible to identify as either. Thus, for example, if the device holds a filesystem (as is commonly the case), that filesystem would be unexpected to mount or even pass an automatic filesystem check.

Outdated data, by contrast, is data on a secondary node that is consistent, but no longer in sync with the primary node. This would occur in any interruption of the replication link, whether temporary or permanent. Data on an outdated, disconnected secondary node is expected to be clean, but it reflects a state of the peer node some time past. In order to avoid services using outdated data, DRBD disallows promoting a resource that is in the outdated state.

DRBD has interfaces that allow an external application to outdate a secondary node as soon as a network interruption occurs. DRBD will then refuse to switch the node to the primary role, preventing applications from using the outdated data. A complete implementation of this functionality exists for the Heartbeat cluster management framework (where it uses a communication channel separate from the DRBD replication link). However, the interfaces are generic and may be easily used by any other cluster management application.

Whenever an outdated resource has its replication link re-established, its outdated flag is automatically cleared. A background synchronization then follows.

See the section about the DRBD outdate-peer daemon (dopd) for an example DRBD/Heartbeat configuration enabling protection against inadvertent use of outdated data.

Three-way replication

Available in DRBD version 8.3.0 and above

When using three-way replication, DRBD adds a third node to an existing 2-node cluster and replicates data to that node, where it can be used for backup and disaster recovery purposes.

Three-way replication works by adding another, stacked DRBD resource on top of the existing resource holding your production data, as seen in this illustration:

Figure 2.1. DRBD resource stacking

DRBD resource stacking


The stacked resource is replicated using asynchronous replication (DRBD protocol A), whereas the production data would usually make use of synchronous replication (DRBD protocol C).

Three-way replication can be used permanently, where the third node is continously updated with data from the production cluster. Alternatively, it may also be employed on demand, where the production cluster is normally disconnected from the backup site, and site-to-site synchronization is performed on a regular basis, for example by running a nightly cron job.

Long-distance replication with DRBD Proxy

DRBD Proxy requires DRBD version 8.2.7 or above.

DRBD's protocol A is asynchronous, but the writing application will block as soon as the socket output buffer is full (see the sndbuf-size option in drbd.conf(5)). In that event, the writing application has to wait until some of the data written runs off through a possibly small bandwith network link.

The average write bandwith is limited by available bandwith of the network link. Write bursts can only be handled gracefully if they fit into the limited socket output buffer.

You can mitigate this by DRBD Proxy's buffering mechanism. DRBD Proxy will suck up all available data from the DRBD on the primary node into its buffers. DRBD Proxy's buffer size is freely configurable, only limited by the address room size and available physical RAM.

Optionally DRBD Proxy can be configured to compress and decompress the data it forwards. Compression and decompression of DRBD's data packets might slightly increase latency. But when the bandwidth of the network link is the limiting factor, the gain in shortening transmit time outweighs the compression and decompression overhead by far.

Compression and decompression were implemented with multi core SMP systems in mind, and can utilize multiple CPU cores.

The fact that most block I/O data compresses very well and therefore the effective bandwidth increases well justifies the use of the DRBD Proxy even with DRBD protocols B and C.

See the section called “Using DRBD Proxy” for information on configuring DRBD Proxy.

[Note]Note

DRBD Proxy is the only part of the DRBD product family that is not published under an open source license. Please contact or for an evaluation license.

Truck based replication

Truck based replication, also known as disk shipping, is a means of preseeding a remote site with data to be replicated, by physically shipping storage media to the remote site. This is particularly suited for situations where

  • the total amount of data to be replicated is fairly large (more than a few hundreds of gigabytes);

  • the expected rate of change of the data to be replicated is less than enormous;

  • the available network bandwidth between sites is limited.

In such situations, without truck based replication, DRBD would require a very long initial device synchronization (on the order of days or weeks). Truck based replication allows us to ship a data seed to the remote site, and drastically reduce the initial synchronization time.

See the section called “Using truck based replication” for details on this use case.

Floating peers

This feature is available in DRBD versions 8.3.2 and above.

A somewhat special use case for DRBD is the floating peers configuration. In floating peer setups, DRBD peers are not tied to specific named hosts (as in conventional configurations), but instead have the ability to float between several hosts. In such a configuration, DRBD identifies peers by IP address, rather than by host name.

For more information about managing floating peer configurations, see the section called “Configuring DRBD to replicate between two SAN-backed Pacemaker clusters”.