Part III. Working with DRBD

Chapter 6. Common administrative tasks

This chapter outlines typical administrative tasks encountered during day-to-day operations. It does not cover troubleshooting tasks, these are covered in detail in Chapter 7, Troubleshooting and error recovery.

Checking DRBD status

Retrieving status with drbd-overview

The most convenient way to look at DRBD's status is the drbd-overview utility. It is available since DRBD-8.0.15 and 8.3.0. In 8.3.0 it is installed as drbd-overview.pl.

drbd-overview
  0:home                 Connected Primary/Secondary   UpToDate/UpToDate C r--- /home        xfs  200G 158G 43G  79%
  1:data                 Connected Primary/Secondary   UpToDate/UpToDate C r--- /mnt/ha1     ext3 9.9G 618M 8.8G 7%
  2:nfs-root             Connected Primary/Secondary   UpToDate/UpToDate C r--- /mnt/netboot ext3 79G  57G  19G  76%

Status information in /proc/drbd

/proc/drbd is a virtual file displaying real-time status information about all DRBD resources currently configured. You may interrogate this file's contents using this command:

cat /proc/drbd
version: 8.3.0 (api:88/proto:86-89)
GIT-hash: 9ba8b93e24d842f0dd3fb1f9b90e8348ddb95829 build by buildsystem@linbit, 2008-12-18 16:02:26
 0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r---
    ns:0 nr:8 dw:8 dr:0 al:0 bm:2 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r---
    ns:0 nr:12 dw:12 dr:0 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 2: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

The first line, prefixed with version:, shows the DRBD version used on your system. The second line contains information about this specific build.

The other four lines in this example form a block that is repeated for every DRBD device configured, prefixed by the device minor number. In this case, this is 0, corresponding to the device /dev/drbd0.

The resource-specific output from /proc/drbd contains various pieces of information about the resource:

  • cs (connection state). Status of the network connection. See the section called “Connection states” for details about the various connection states.

  • ro (roles). Roles of the nodes. The role of the local node is displayed first, followed by the role of the partner node shown after the slash. See the section called “Resource roles” for details about the possible resource roles.

    [Note]Note

    Prior to DRBD 8.3, /proc/drbd used the st field (referring to the ambiguous term state) when referring to resource roles.

  • ds (disk states). State of the hard disks. Prior to the slash the state of the local node is displayed, after the slash the state of the hard disk of the partner node is shown. See the section called “Disk states” for details about the various disk states.

  • ns (network send).  Volume of net data sent to the partner via the network connection; in Kibyte.

  • nr (network receive).  Volume of net data received by the partner via the network connection; in Kibyte.

  • dw (disk write). Net data written on local hard disk; in Kibyte.

  • dr (disk read). Net data read from local hard disk; in Kibyte.

  • al (activity log). Number of updates of the activity log area of the meta data.

  • bm (bit map).  Number of updates of the bitmap area of the meta data.

  • lo (local count). Number of open requests to the local I/O sub-system issued by DRBD.

  • pe (pending). Number of requests sent to the partner, but that have not yet been answered by the latter.

  • ua (unacknowledged). Number of requests received by the partner via the network connection, but that have not yet been answered.

  • ap (application pending). Number of block I/O requests forwarded to DRBD, but not yet answered by DRBD.

  • ep (epochs). Number of epoch objects. Usually 1. Might increase under I/O load when using either the barrier or the none write ordering method. Since 8.2.7.

  • wo (write order). Currently used write ordering method: b (barrier), f (flush), d (drain) or n (none). Since 8.2.7.

  • oos (out of sync). Amount of storage currently out of sync; in Kibibytes. Since 8.2.6.

Connection states

A resource's connection state can be observed either by monitoring /proc/drbd, or by issuing the drbdadm cstate command:

drbdadm cstate resource
Connected

A resource may have one of the following connection states:

  • StandAloneNo network configuration available. The resource has not yet been connected, or has been administratively disconnected (using drbdadm disconnect), or has dropped its connection due to failed authentication or split brain.

  • Disconnecting Temporary state during disconnection. The next state is StandAlone.

  • Unconnected Temporary state, prior to a connection attempt. Possible next states: WFConnection and WFReportParams.

  • TimeoutTemporary state following a timeout in the communication with the peer. Next state: Unconnected.

  • BrokenPipeTemporary state after the connection to the peer was lost. Next state: Unconnected.

  • NetworkFailureTemporary state after the connection to the partner was lost. Next state: Unconnected.

  • ProtocolErrorTemporary state after the connection to the partner was lost. Next state: Unconnected.

  • TearDownTemporary state. The peer is closing the connection. Next state: Unconnected.

  • WFConnectionThis node is waiting until the peer node becomes visible on the network.

  • WFReportParamsTCP connection has been established, this node waits for the first network packet from the peer.

  • ConnectedA DRBD connection has been established, data mirroring is now active. This is the normal state.

  • StartingSyncSFull synchronization, initiated by the administrator, is just starting. The next possible states are: SyncSource or PausedSyncS.

  • StartingSyncTFull synchronization, initiated by the administrator, is just starting. Next state: WFSyncUUID.

  • WFBitMapSPartial synchronization is just starting. Next possible states: SyncSource or PausedSyncS.

  • WFBitMapTPartial synchronization is just starting. Next possible state: WFSyncUUID.

  • WFSyncUUIDSynchronization is about to begin. Next possible states: SyncTarget or PausedSyncT.

  • SyncSourceSynchronization is currently running, with the local node being the source of synchronization.

  • SyncTargetSynchronization is currently running, with the local node being the target of synchronization.

  • PausedSyncSThe local node is the source of an ongoing synchronization, but synchronization is currently paused. This may be due to a dependency on the completion of another synchronization process, or due to synchronization having been manually interrupted by drbdadm pause-sync.

  • PausedSyncTThe local node is the target of an ongoing synchronization, but synchronization is currently paused. This may be due to a dependency on the completion of another synchronization process, or due to synchronization having been manually interrupted by drbdadm pause-sync.

  • VerifySOn-line device verification is currently running, with the local node being the source of verification.

  • VerifyTOn-line device verification is currently running, with the local node being the target of verification.

Resource roles

A resource's role can be observed either by monitoring /proc/drbd, or by issuing the drbdadm role command:

drbdadm role resource
Primary/Secondary

The local resource role is always displayed first, the remote resource role last.

[Note]Note

Prior to DRBD 8.3, the drbdadm state command provided the same information. Since state is an ambigious term, DRBD uses role in its stead from version 8.3.0 forward. drbdadm state is also still available, albeit only for compatibility reasons. You should use drbdadm role.

You may see one of the following resource roles:

  • PrimaryThe resource is currently in the primary role, and may be read from and written to. This role only occurs on one of the two nodes, unless dual-primary node is enabled.

  • SecondaryThe resource is currently in the secondary role. It normally receives updates from its peer (unless running in disconnected mode), but may neither be read from nor written to. This role may occur on one node or both nodes.

  • UnknownThe resource's role is currently unknown. The local resource role never has this status. It is only displayed for the peer's resource role, and only in disconnected mode.

Disk states

A resource's disk state can be observed either by monitoring /proc/drbd, or by issuing the drbdadm dstate command:

drbdadm dstate resource
UpToDate/UpToDate

The local disk state is always displayed first, the remote disk state last.

Both the local and the remote disk state may be one of the following:

  • DisklessNo local block device has been assigned to the DRBD driver. This may mean that the resource has never attached to its backing device, that it has been manually detached using drbdadm detach, or that it automatically detached after a lower-level I/O error.

  • AttachingTransient state while reading meta data.

  • FailedTransient state following an I/O failure report by the local block device. Next state: Diskless.

  • NegotiatingTransient state when an Attach is carried out on an already-connected DRBD device.

  • InconsistentThe data is inconsistent. This status occurs immediately upon creation of a new resource, on both nodes (before the initial full sync). Also, this status is found in one node (the synchronization target) during synchronization.

  • OutdatedResource data is consistent, but outdated.

  • DUnknownThis state is used for the peer disk if no network connection is available.

  • ConsistentConsistent data of a node without connection. When the connection is established, it is decided whether the data are UpToDate or Outdated.

  • UpToDateConsistent, up-to-date state of the data. This is the normal state.

Enabling and disabling resources

Enabling resources

Normally, all resources configured in /etc/drbd.conf are automatically enabled upon system startup by the /etc/init.d/drbd init script. If you choose to disable this startup script (as may be required by some applications), you may enable specific resources by issuing the commands

drbdadm attach resource
drbdadm syncer resource
drbdadm connect resource

or the shorthand version of the three commands above,

drbdadm up resource

As always, you may use the keyword all instead of a specific resource name if you want to enable all resources configured in /etc/drbd.conf at once.

Disabling resources

You may temporarily disable specific resources by issuing the commands

drbdadm disconnect resource
drbdadm detach resource

or the shorthand version of the above,

drbdadm down resource
[Note]Note

There is, in fact, a slight syntactical difference between these two methods. While drbdadm down implies a preceding resource demotion, drbdadm disconnect/detach does not. So while you can run drbdadm down on a resource that is currently in the primary role, drbdadm disconnect/detach in the same situation will be refused by DRBD's internal state engine.

Here, too, you may use the keyword all in place of a resource name if you wish to temporarily disable all resources listed in /etc/drbd.conf at once.

Reconfiguring resources

DRBD allows you to reconfigure resources while they are operational. To that end,

  • make any necessary changes to the resource configuration in /etc/drbd.conf,

  • synchronize your /etc/drbd.conf file between both nodes,

  • issue the drbdadm adjust resource command on both nodes.

drbdadm adjust then hands off to drbdsetup to make the necessary adjustments to the configuration. As always, you are able to review the pending drbdsetup invocations by running drbdadm with the -d (dry-run) option.

[Note]Note

When making changes to the common section in /etc/drbd.conf, you can adjust the configuration for all resources in one run, by issuing drbdadm adjust all.

Promoting and demoting resources

Manually switching a resource's role from secondary to primary (promotion) or vice versa (demotion) is done using the following commands:

drbdadm primary resource
drbdadm secondary resource

In single-primary mode (DRBD's default), any resource can be in the primary role on only one node at any given time while the connection state is Connected. Thus, issuing drbdadm primary resource on one node while resource is still in the primary role on the peer will result in an error.

A resource configured to allow dual-primary mode can be switched to the primary role on both nodes.

Enabling dual-primary mode

This feature is available in DRBD 8.0 and later.

To enable dual-primary mode, add the allow-two-primaries option to the net section of your resource configuration:

resource resource
  net {
    allow-two-primaries;
  }
  ...
}

When a resource is configured to support dual-primary mode, it may also be desirable to automatically switch the resource into the primary role upon system (or DRBD) startup. To do this, add the become-primary-on option, available in DRBD 8.2.0 and above, to the startup section of your resource configuration:

resource resource
  startup {
    become-primary-on both;
  }
  ...
}

After you have made these changes to /etc/drbd.conf, do not forget to synchronize the configuration between nodes. Then, proceed as follows:

  • Run drbdadm disconnect resource on both nodes.

  • Execute drbdadm connect resource on both nodes.

  • Finally, you may now execute drbdadm primary resource on both nodes, instead of on just one.

Using on-line device verification

This feature is available in DRBD 8.2.5 and later.

Enabling on-line verification

On-line device verification is not enabled for resources by default. To enable it, add the following lines to your resource configuration in /etc/drbd.conf:

resource resource
  syncer {
    verify-alg algorithm;
  }
  ...
}

algorithm may be any message digest algorithm supported by the kernel crypto API in your system's kernel configuration. Normally, you should be able to choose at least from sha1, md5, and crc32c.

If you make this change to an existing resource, as always, synchronize your drbd.conf to the peer, and run drbdadm adjust resource on both nodes.

Invoking on-line verification

After you have enabled on-line verification, you will be able to initiate a verification run using the following command:

drbdadm verify resource

When you do so, DRBD starts an online verification run for resource, and if it detects any blocks not in sync, will mark those blocks as such and write a message to the kernel log. Any applications using the device at that time can continue to do so unimpeded, and you may also switch resource roles at will.

If out-of-sync blocks were detected during the verification run, you may resynchronize them using the following commands after verification has completed:

drbdadm disconnect resource
drbdadm connect resource

Automating on-line verification

Most users will want to automate on-line device verification. This can be easily accomplished. Create a file with the following contents, named /etc/cron.d/drbd-verify on one of your nodes:

42 0 * * 0    root    /sbin/drbdadm verify resource

This will have cron invoke a device verification every Sunday at 42 minutes past midnight.

If you have enabled on-line verification for all your resources (for example, by adding verify-alg algorithm to the common section in /etc/drbd.conf), you may also use:

42 0 * * 0    root    /sbin/drbdadm verify all

Configuring the rate of synchronization

Normally, one tries to ensure that background synchronization (which makes the data on the synchronization target temporarily inconsistent) completes as quickly as possible. However, it is also necessary to keep background synchronization from hogging all bandwidth otherwise available for foreground replication, which would be detrimental to application performance. Thus, you must configure the sychronization bandwidth to match your hardware — which you may do in a permanent fashion or on-the-fly.

[Important]Important

It does not make sense to set a synchronization rate that is higher than the maximum write throughput on your secondary node. You must not expect your secondary node to miraculously be able to write faster than its I/O subsystem allows, just because it happens to be the target of an ongoing device synchronization.

Likewise, and for the same reasons, it does not make sense to set a synchronization rate that is higher than the bandwidth available on the replication network.

Permanent syncer rate configuration

The maximum bandwidth a resource uses for background re-synchronization is permanently configured using the rate option for a resource. This must be included in the resource configuration's syncer section in /etc/drbd.conf:

resource resource
  syncer {
    rate 40M;
    ...
  }
  ...
}

Note that the rate setting is given in bytes, not bits per second.

[Tip]Tip

A good rule of thumb for this value is to use about 30% of the available replication bandwidth. Thus, if you had an I/O subsystem capable of sustaining write throughput of 180MB/s, and a Gigabit Ethernet network capable of sustaining 110 MB/s network throughput (the network being the bottleneck), you would calculate:

Equation 6.1. Syncer rate example, 110MB/s effective available bandwidth


Thus, the recommended value for the rate option would be 33M.

By contrast, if you had an I/O subsystem with a maximum throughput of 80MB/s and a Gigabit Ethernet connection (the I/O subsystem being the bottleneck), you would calculate:

Equation 6.2. Syncer rate example, 80MB/s effective available bandwidth


In this case, the recommended value for the rate option would be 24M.

Temporary syncer rate configuration

It is sometimes desirable to temporarily adjust the syncer rate. For example, you might want to speed up background re-synchronization after having performed scheduled maintenance on one of your cluster nodes. Or, you might want to throttle background re-synchronization if it happens to occur at a time when your application is extremely busy with write operations, and you want to make sure that a large portion of the existing bandwidth is available to replication.

For example, in order to make most bandwidth of a Gigabit Ethernet link available to re-synchronization, issue the following command:

drbdsetup /dev/drbdnum syncer -r 110M

As always, replace num with the device minor number of your DRBD device. You need to issue this command on only one of your nodes.

To revert this temporary setting and re-enable the syncer rate set in /etc/drbd.conf, issue this command:

drbdadm adjust resource

Configuring checksum-based synchronization

Checksum-based synchronization is not enabled for resources by default. To enable it, add the following lines to your resource configuration in /etc/drbd.conf:

resource resource
  syncer {
    csums-alg algorithm;
  }
  ...
}

algorithm may be any message digest algorithm supported by the kernel crypto API in your system's kernel configuration. Normally, you should be able to choose at least from sha1, md5, and crc32c.

If you make this change to an existing resource, as always, synchronize your drbd.conf to the peer, and run drbdadm adjust resource on both nodes.

Configuring I/O error handling strategies

DRBD's strategy for handling lower-level I/O errors is determined by the on-io-error option, included in the resource disk configuration in /etc/drbd.conf:

resource resource {
  disk {
    on-io-error strategy;
    ...
  }
  ...
}

You may, of course, set this in the common section too, if you want to define a global I/O error handling policy for all resources.

strategy may be one of the following options:

  • detachThis is the recommended option. On the occurrence of a lower-level I/O error, the node drops its backing device, and continues in diskless mode.

  • pass_onThis causes DRBD to report the I/O error to the upper layers. On the primary node, it is reported to the mounted file system. On the secondary node, it is ignored (because the secondary has no upper layer to report to). This is the default for historical reasons, but is no longer recommended for most new installations — except if you have a very compelling reason to use this strategy, instead of detach.

  • call-local-io-errorInvokes the command defined as the local I/O error handler. This requires that a corresponding local-io-error command invocation is defined in the resource's handlers section. It is entirely left to the administrator's discretion to implement I/O error handling using the command (or script) invoked by local-io-error.

    [Note]Note

    Early DRBD versions (prior to 8.0) included another option, panic, which would forcibly remove the node from the cluster by way of a kernel panic, whenever a local I/O error occurred. While that option is no longer available, the same behavior may be mimicked via the local-io-error/ call-local-io-error interface. You should do so only if you fully understand the implications of such behavior.

You may reconfigure a running resource's I/O error handling strategy by following this process:

  • Edit the resource configuration in /etc/drbd.conf.

  • Copy the configuration to the peer node.

  • Issue drbdadm adjust resource on both nodes.

[Note]Note

DRBD versions prior to 8.3.1 will incur a full resync after running drbdadm adjust on a node that is in the Primary role. On such systems, the affected resource must be demoted prior to running drbdadm adjust after its disk configuration section has been changed.

Configuring replication traffic integrity checking

Replication traffic integrity checking is not enabled for resources by default. To enable it, add the following lines to your resource configuration in /etc/drbd.conf:

resource resource
  net {
    data-integrity-alg algorithm;
  }
  ...
}

algorithm may be any message digest algorithm supported by the kernel crypto API in your system's kernel configuration. Normally, you should be able to choose at least from sha1, md5, and crc32c.

If you make this change to an existing resource, as always, synchronize your drbd.conf to the peer, and run drbdadm adjust resource on both nodes.

Resizing resources

Growing on-line

If the backing block devices can be grown while in operation (online), it is also possible to increase the size of a DRBD device based on these devices during operation. To do so, two criteria must be fulfilled:

  1. The affected resource's backing device must be one managed by a logical volume management subsystem, such as LVM or EVMS.

  2. The resource must currently be in the Connected connection state.

Having grown the backing block devices on both nodes, ensure that only one node is in primary state. Then enter on one node:

drbdadm resize resource

This triggers a synchronization of the new section. The synchronization is done from the primary node to the secondary node.

Growing off-line

When the backing block devices on both nodes are grown while DRBD is inactive, and the DRBD resource is using external meta data, then the new size is recognized automatically. No administrative intervention is necessary. The DRBD device will have the new size after the next activation of DRBD on both nodes and a successful establishment of a network connection.

If however the DRBD resource is configured to use internal meta data, then this meta data must be moved to the end of the grown device before the new size becomes available. To do so, complete the following steps:

[Warning]Warning

This is an advanced procedure. Use at your own discretion.

  1. Unconfigure your DRBD resource:

    drbdadm down resource
  2. Save the meta data in a text file prior to shrinking:

    drbdadm dump-md resource > /tmp/metadata

    You must do this on both nodes, using a separate dump file for every node. Do not dump the meta data on one node, and simply copy the dump file to the peer. This will not work.

  3. Grow the backing block device on both nodes.

  4. Adjust the size information (la-size-sect) in the file /tmp/metadata accordingly, on both nodes. Remember that la-size-sect must be specified in sectors.

  5. Re-initialize the metadata area:

    drbdadm create-md resource
  6. Re-import the corrected meta data, on both nodes:

    drbdmeta_cmd=$(drbdadm -d	dump-md test-disk)
    ${drbdmeta_cmd/dump-md/restore-md} /tmp/metadata
     Valid meta-data in place, overwrite? [need to type 'yes' to confirm] yes
     Successfully restored meta data
    [Note]Note

    This example uses bash parameter substitution. It may or may not work in other shells. Check your SHELL environment variable if you are unsure which shell you are currently using.

  7. Re-enable your DRBD resource:

    drbdadm up resource
  8. On one node, promote the DRBD resource:

    drbdadm primary resource
  9. Finally, grow the file system so it fills the extended size of the DRBD device.

Shrinking on-line

[Warning]Warning

Online shrinking is only supported with external metadata.

Before shrinking a DRBD device, you must shrink that the layers above DRBD, i.e. usually the file system. Since DRBD cannot ask the file system how much space it actually uses, you have to be careful in order not to cause data loss.

[Note]Note

Whether or not the filesystem can be shrunk on-line depends on the filesystem being used. Most filesystems do not support on-line shrinking. XFS does not support shrinking at all.

To shrink DRBD on-line, issue the following command after you have shrunk the file system residing on top of it:

drbdadm -- --size=new-size resize resource

You may use the usual multiplier suffixes for new-size (K, M, G etc.). After you have shrunk DRBD, you may also shrink the containing block device (if it supports shrinking).

Shrinking off-line

If you were to shrink a backing block device while DRBD is inactive, DRBD would refuse to attach to this block device during the next attach attempt, since it is now too small (in case external meta data is used), or it would be unable to find its meta data (in case internal meta data are used). To work around these issues, use this procedure (if you cannot use on-line shrinking):

[Warning]Warning

This is an advanced procedure. Use at your own discretion.

  1. Shrink the file system from one node, while DRBD is still configured.

  2. Unconfigure your DRBD resource:

    drbdadm down resource
  3. Save the meta data in a text file prior to shrinking:

    drbdadm dump-md resource > /tmp/metadata

    You must do this on both nodes, using a separate dump file for every node. Do not dump the meta data on one node, and simply copy the dump file to the peer. This will not work.

  4. Shrink the backing block device on both nodes.

  5. Adjust the size information (la-size-sect) in the file /tmp/metadata accordingly, on both nodes. Remember that la-size-sect must be specified in sectors.

  6. Only if you are using internal metadata (which at this time have probably been lost due to the shrinking process), re-initialize the metadata area:

    drbdadm create-md resource
  7. Re-import the corrected meta data, on both nodes:

    drbdmeta_cmd=$(drbdadm -d	dump-md test-disk)
    ${drbdmeta_cmd/dump-md/restore-md} /tmp/metadata
       Valid meta-data in place, overwrite? [need to type 'yes' to confirm] yes
       Successfully restored meta data
    [Note]Note

    This example uses bash parameter substitution. It may or may not work in other shells. Check your SHELL environment variable if you are unsure which shell you are currently using.

  8. Re-enable your DRBD resource:

    drbdadm up resource

Disabling backing device flushes

[Caution]Caution

You should only disable device flushes when running DRBD on devices with a battery-backed write cache (BBWC). Most storage controllers allow to automatically disable the write cache when the battery is depleted, switching to write-through mode when the battery dies. It is strongly recommended to enable such a feature.

Disabling DRBD's flushes when running without BBWC, or on BBWC with a depleted battery, is likely to cause data loss and should not be attempted.

DRBD allows you to enable and disable backing device flushes separately for the replicated data set and DRBD's own meta data. Both of these options are enabled by default. If you wish to disable either (or both), you would set this in the disk section for the DRBD configuration file, /etc/drbd.conf.

To disable disk flushes for the replicated data set, include the following line in your configuration:

resource resource
  disk {
    no-disk-flushes;
    ...
  }
  ...
}

To disable disk flushes on DRBD's meta data, include the following line:

resource resource
  disk {
    no-md-flushes;
    ...
  }
  ...
}

After you have modified your resource configuration (and synchronized your /etc/drbd.conf between nodes, of course), you may enable these settings by issuing these commands on both nodes:

drbdadm down resource
drbdadm up resource

Configuring split brain behavior

Split brain notification

DRBD invokes the split-brain handler, if configured, at any time split brain is detected. To configure this handler, add the following item to your resource configuration:

resource resource
  handlers {
    split-brain handler;
    ...
  }
  ...
}

handler may be any executable present on the system.

Since DRBD version 8.2.6, the DRBD distribution contains a split brain handler script that installs as /usr/lib/drbd/notify-split-brain.sh. It simply sends a notification e-mail message to a specified address. To configure the handler to send a message to root@localhost (which is expected to be an email address that forwards the notification to a real system administrator), configure the split-brain handler as follows:

resource resource
  handlers {
    split-brain "/usr/lib/drbd/notify-split-brain.sh root";
    ...
  }
  ...
}

After you have made this modfication on a running resource (and synchronized the configuration file between nodes), no additional intervention is needed to enable the handler. DRBD will simply invoke the newly-configured handler on the next occurrence of split brain.

Automatic split brain recovery policies

In order to be able to enable and configure DRBD's automatic split brain recovery policies, you must understand that DRBD offers several configuration options for this purpose. DRBD applies its split brain recovery procedures based on the number of nodes in the Primary role at the time the split brain is detected. To that end, DRBD examines the following keywords, all found in the resource's net configuration section:

  • after-sb-0priSplit brain has just been detected, but at this time the resource is not in the Primary role on any host. For this option, DRBD understands the following keywords:

    • disconnectDo not recover automatically, simply invoke the split-brain handler script (if configured), drop the connection and continue in disconnected mode.

    • discard-younger-primaryDiscard and roll back the modifications made on the host which assumed the Primary role last.

    • discard-least-changesDiscard and roll back the modifications on the host where fewer changes occurred.

    • discard-zero-changesIf there is any host on which no changes occurred at all, simply apply all modifications made on the other and continue.

  • after-sb-1priSplit brain has just been detected, and at this time the resource is in the Primary role on one host. For this option, DRBD understands the following keywords:

    • disconnectAs with after-sb-0pri, simply invoke the split-brain handler script (if configured), drop the connection and continue in disconnected mode.

    • consensusApply the same recovery policies as specified in after-sb-0pri. If a split brain victim can be selected after applying these policies, automatically resolve. Otherwise, behave exactly as if disconnect were specified.

    • call-pri-lost-after-sbApply the recovery policies as specified in after-sb-0pri. If a split brain victim can be selected after applying these policies, invoke the pri-lost-after-sb handler on the victim node. This handler must be configured in the handlers section and is expected to forcibly remove the node from the cluster.

    • discard-secondaryWhichever host is currently in the Secondary role, make that host the split brain victim.

  • after-sb-2priSplit brain has just been detected, and at this time the resource is in the Primary role on both hosts. This option accepts the same keywords as after-sb-1pri except discard-secondary and consensus.

[Note]Note

DRBD understands additional keywords for these three options, which have been omitted here because they are very rarely used. Refer to drbd.conf(5) for details on split brain recovery keywords not discussed here.

For example, a resource which serves as the block device for a GFS or OCFS2 file system in dual-Primary mode may have its recovery policy defined as follows:

resource resource {
  handlers {
    split-brain "/usr/lib/drbd/notify-split-brain.sh root"
    ...
  }
  net {
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    ...
  }
  ...
}

Creating a three-node setup

Available in DRBD version 8.3.0 and above

A three-node setup involves one DRBD device stacked atop another.

Device stacking considerations

The following considerations apply to this type of setup:

  • The stacked device is the active one. Assume you have configured one DRBD device /dev/drbd0, and the stacked device atop it is /dev/drbd10, then /dev/drbd10 will be the device that you mount and use.

  • Device meta data will be stored twice, on the underlying DRBD device and the stacked DRBD device. On the stacked device, you must always use internal meta data. This means that the effectively available storage area on a stacked device is slightly smaller, compared to an unstacked device.

  • To get the stacked upper level device running, the underlying device must be in the primary role.

  • To be able to synchronize the backup node, the stacked device on the active node must be up and in the primary role.

Configuring a stacked resource

In the following example, nodes are named alice, bob, and charlie, with alice and bob forming a two-node cluster, and charlie being the backup node.

resource r0 {
  protocol C;

  on alice {
    device     /dev/drbd0;
    disk       /dev/sda6;
    address    10.0.0.1:7788;
    meta-disk internal;
  }

  on bob {
    device    /dev/drbd0;
    disk      /dev/sda6;
    address   10.0.0.2:7788;
    meta-disk internal;
  }
}

resource r0-U {
  protocol A;

  stacked-on-top-of r0 {
    device     /dev/drbd10;
    address    192.168.42.1:7788;
  }

  on charlie {
    device     /dev/drbd10;
    disk       /dev/hda6;
    address    192.168.42.2:7788; # Public IP of the backup node
    meta-disk  internal;
  }
}

As with any drbd.conf configuration file, this must be distributed across all nodes in the cluster — in this case, three nodes. Notice the following extra keyword not found in an unstacked resource configuration:

  • stacked-on-top-ofThis option informs DRBD that the resource which contains it is a stacked resource. It replaces one of the on sections normally found in any resource configuration. Do not use stacked-on-top-of in an lower-level resource.

[Note]Note

It is not a requirement to use Protocol A for stacked resources. You may select any of DRBD's replication protocols depending on your application.

Enabling stacked resources

To enable a stacked resource, you first enable its lower-level resource and promote it:

drbdadm up r0
drbdadm primary r0

As with unstacked resources, you must create DRBD meta data on the stacked resources. This is done using the following command:

drbdadm --stacked create-md r0-U

Then, you may enable the stacked resource:

drbdadm --stacked up r0-U
drbdadm --stacked primary r0-U

After this, you may bring up the resource on the backup node, enabling three-node replication:

drbdadm create-md r0-U
drbdadm up r0-U

In order to automate stacked resource management, you may integrate stacked resources in your cluster manager configuration. See the section called “Using stacked DRBD resources in Pacemaker clusters” for information on doing this in a cluster managed by the Pacemaker cluster management framework.

Using DRBD Proxy

DRBD Proxy deployment considerations

The DRBD Proxy processes can either be located directly on the machines where DRBD is set up, or they can be placed on distinct dedicated servers. A DRBD Proxy instance can serve as a proxy for multiple DRBD devices distributed across multiple nodes.

DRBD Proxy is completely transparent to DRBD. Typically you will expect a high number of data packets in flight, therefore the activity log should be reasonably large. Since this may cause longer re-sync runs after the crash of a primary node, it is recommended to enable DRBD's csums-alg setting.

Installation

To obtain DRBD Proxy, please contact your Linbit sales representative. Unless instructed otherwise, please always use the most recent DRBD Proxy release.

To install DRBD Proxy on Debian and Debian-based systems, use the dpkg tool as follows (replace version with your DRBD Proxy version, and architecture with your target architecture):

	dpkg -i drbd-proxy_1.0.16_i386.deb
      

To install DRBD Proxy on RPM based systems (like SLES or Redhat) use the rpm tool as follows (replace version with your DRBD Proxy version, and architecture with your target architecture):

	rpm -i drbd-proxy-1.0.16-1.i386.rpm
      

Also install the DRBD administration program drbdadm since it is required to configure DRBD Proxy.

This will install the DRBD proxy binaries as well as an init script which usually goes into /etc/init.d. Please always use the init script to start/stop DRBD proxy since it also configures DRBD Proxy using the drbdadm tool.

License file

When obtaining a license from Linbit, you will be sent a DRBD Proxy license file which is required to run DRBD Proxy. The file is called drbd-proxy.license and must be copied into the /etc directory of the target machines.

          cp drbd-proxy.license /etc
        

Configuration

DRBD Proxy is configured in DRBD's main configuration file. It is configured by an additional options section called proxy and additional proxy on sections within the host sections.

Below is a DRBD configuration example for proxies running directly on the DRBD nodes:

resource r0 {
        protocol A;
        device     minor 0;
        disk       /dev/sdb1;
        flexible-meta-disk  /dev/sdb2;

	proxy {
	      compression on;
	      memlimit 100M;
	}

        on alice {
                address 127.0.0.1:7789;
                proxy on alice {
                        inside 127.0.0.1:7788;
                        outside 192.168.23.1:7788;
                }
        }

        on bob {
                address 127.0.0.1:7789;
                proxy on bob {
                        inside 127.0.0.1:7788;
                        outside 192.168.23.2:7788;
                }
        }
}

The inside IP address is used for communication between DRBD and the DRBD Proxy, whereas the outside IP address is used for communication between the proxies.

Controlling DRBD Proxy

drbdadm offers the proxy-up and proxy-down subcommands to configure or delete the connection to the local DRBD Proxy process of the named DRBD resource(s). These commands are used by the start and stop actions which /etc/init.d/drbdproxy implements.

The DRBD Proxy has a low level configuration tool, called drbd-proxy-ctl. When called without any option it operates in interactive mode. The available commands are displayed by the 'help' command.

Help for drbd-proxy.
--------------------

add connection <name> <ip-listen1>:<port> <ip-connect1>:<port>
   <ip-listen2>:<port> <ip-connect2>:<port>
   Creates a communication path between two DRBD instances.

set memlimit <name> <memlimit-in-bytes>
   Sets memlimit for connection <name>

del connection <name>
   Deletes communication path named name.

show
   Shows currently configured communication paths.

show memusage
   Shows memory usage of each connection.

list [h]subconnections
   Shows currently established individual connections
   together with some stats. With h outputs bytes in human
   readable format.

list [h]connections
   Shows currently configured connections and their states
   With h outputs bytes in human readable format.

list details
   Shows currently established individual connections with
   counters for each DRBD packet type.

quit
   Exits the client program (closes control connection).

shutdown
   Shuts down the drbd-proxy program. Attention: this
   unconditionally terminates any DRBD connections running.

Troubleshooting

DRBD proxy logs via syslog using the LOG_DAEMON facility. Usually you will find DRBD Proxy messages in /var/log/daemon.log.

For example, if proxy fails to connect it will log something like Rejecting connection because I can't connect on the other side. In that case, please check if DRBD is running (not in StandAlone mode) on both nodes and if both proxies are running. Also double-check your configuration.

Chapter 7. Troubleshooting and error recovery

This chapter describes tasks to be performed in the event of hardware or system failures.

Dealing with hard drive failure

How to deal with hard drive failure depends on the way DRBD is configured to handle disk I/O errors (see the section called “Disk error handling strategies”), and on the type of meta data configured (see the section called “DRBD meta data”).

[Note]Note

For the most part, the steps described here apply only if you run DRBD directly on top of physical hard drives. They generally do not apply in case you are running DRBD layered on top of

  • an MD software RAID set (in this case, use mdadm to manage drive replacement),

  • device-mapper RAID (use dmraid),

  • a hardware RAID appliance (follow the vendor's instructions on how to deal with failed drives),

  • some non-standard device-mapper virtual block devices (see the device mapper documentation),

  • EVMS volumes (see the EVMS documentation).

Manually detaching DRBD from your hard drive

If DRBD is configured to pass on I/O errors (not recommended), you must first detach the DRBD resource, that is, disassociate it from its backing storage:

drbdadm detach resource

By running the drbdadm dstate command, you will now be able to verify that the resource is now in diskless mode:

drbdadm dstate resource
Diskless/UpToDate

If the disk failure has occured on your primary node, you may combine this step with a switch-over operation.

Automatic detach on I/O error

If DRBD is configured to automatically detach upon I/O error (the recommended option), DRBD should have automatically detached the resource from its backing storage already, without manual intervention. You may still use the drbdadm dstate command to verify that the resource is in fact running in diskless mode.

Replacing a failed disk when using internal meta data

If using internal meta data, it is sufficient to bind the DRBD device to the new hard disk. If the new hard disk has to be addressed by another Linux device name than the defective disk, this has to be modified accordingly in the DRBD configuration file.

This process involves creating a new meta data set, then re-attaching the resource:

drbdadm create-md resource
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initialized bitmap
New drbd meta data block sucessfully created.
success

drbdadm attach resource

Full synchronization of the new hard disk starts instantaneously and automatically. You will be able to monitor the synchronization's progress via /proc/drbd, as with any background synchronization.

Replacing a failed disk when using external meta data

When using external meta data, the procedure is basically the same. However, DRBD is not able to recognize independently that the hard drive was swapped, thus an additional step is required.

drbdadm create-md resource
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initialized bitmap
New drbd meta data block sucessfully created.
success

drbdadm attach resource
drbdadm invalidate resource

Here, the drbdadm invalidate command triggers synchronization. Again, sync progress may be observed via /proc/drbd.

Dealing with node failure

When DRBD detects that its peer node is down (either by true hardware failure or manual intervention), DRBD changes its connection state from Connected to WFConnection and waits for the peer node to re-appear. The DRBD resource is then said to operate in disconnected mode. In disconnected mode, the resource and its associated block device are fully usable, and may be promoted and demoted as necessary, but no block modifications are being replicated to the peer node. Instead, DRBD stores internal information on which blocks are being modified while disconnected.

Dealing with temporary secondary node failure

If a node that currently has a resource in the secondary role fails temporarily (due to, for example, a memory problem that is subsequently rectified by replacing RAM), no further intervention is necessary — besides the obvious necessity to repair the failed node and bring it back on line. When that happens, the two nodes will simply re-establish connectivity upon system start-up. After this, DRBD replicates all modifications made on the primary node in the meantime, to the secondary node.

[Important]Important

At this point, due to the nature of DRBD's re-synchronization algorithm, the resource is briefly inconsistent on the secondary node. During that short time window, the secondary node can not switch to the Primary role if the peer is unavailable. Thus, the period in which your cluster is not redundant consists of the actual secondary node down time, plus the subsequent re-synchronization.

Dealing with temporary primary node failure

From DRBD's standpoint, failure of the primary node is almost identical to a failure of the secondary node. The surviving node detects the peer node's failure, and switches to disconnected mode. DRBD does not promote the surviving node to the primary role; it is the cluster management application's responsibility to do so.

When the failed node is repaired and returns to the cluster, it does so in the secondary role, thus, as outlined in the previous section, no further manual intervention is necessary. Again, DRBD does not change the resource role back, it is up to the cluster manager to do so (if so configured).

DRBD ensures block device consistency in case of a primary node failure by way of a special mechanism. For a detailed discussion, refer to the section called “The Activity Log”.

Dealing with permanent node failure

If a node suffers an unrecoverable problem or permanent destruction, you must follow the following steps:

  • Replace the failed hardware with one with similar performance and disk capacity.

    [Note]Note

    Replacing a failed node with one with worse performance characteristics is possible, but not recommended. Replacing a failed node with one with less disk capacity is not supported, and will cause DRBD to refuse to connect to the replaced node.

  • Install the base system and applications.

  • Install DRBD and copy /etc/drbd.conf from the surviving node.

  • Follow the steps outlined in Chapter 5, Configuring DRBD, but stop short of the section called “The initial device synchronization”.

Manually starting a full device synchronization is not necessary at this point, it will commence automatically upon connection to the surviving primary node.

Manual split brain recovery

DRBD detects split brain at the time connectivity becomes available again and the peer nodes exchange the initial DRBD protocol handshake. If DRBD detects that both nodes are (or were at some point, while disconnected) in the primary role, it immediately tears down the replication connection. The tell-tale sign of this is a message like the following appearing in the system log:

Split-Brain detected, dropping connection!

After split brain has been detected, one node will always have the resource in a StandAlone connection state. The other might either also be in the StandAlone state (if both nodes detected the split brain simultaneously), or in WFConnection (if the peer tore down the connection before the other node had a chance to detect split brain).

At this point, unless you configured DRBD to automatically recover from split brain, you must manually intervene by selecting one node whose modifications will be discarded (this node is referred to as the split brain victim). This intervention is made with the following commands:

drbdadm secondary resource
drbdadm -- --discard-my-data connect resource

On the other node (the split brain survivor), if its connection state is also StandAlone, you would enter:

drbdadm connect resource

You may omit this step if the node is already in the WFConnection state; it will then reconnect automatically.

If the resource affected by the split brain is a stacked resource, use drbdadm --stacked instead of just drbdadm.

Upon connection, your split brain victim immediately changes its connection state to SyncTarget, and has its modifications overwritten by the remaining primary node.

[Note]Note

The split brain victim is not subjected to a full device synchronization. Instead, it has its local modifications rolled back, and any modifications made on the split brain survivor propagate to the victim.

After re-synchronization has completed, the split brain is considered resolved and the two nodes form a fully consistent, redundant replicated storage system again.