Chapter 5. Common administrative tasks - Manually

Table of Contents

5.1. Configuring DRBD
5.1.1. Preparing your lower-level storage
5.1.2. Preparing your network configuration
5.1.3. Configuring your resource
5.1.4. Defining network connections
5.1.5. Configuring transport implementations
5.1.6. Enabling your resource for the first time
5.1.7. The initial device synchronization
5.1.8. Using truck based replication
5.1.9. Example configuration for four nodes
5.2. Checking DRBD status
5.2.1. Retrieving status with drbd-overview
5.2.2. Status information in /proc/drbd
5.2.3. Status information via drbdadm
5.2.4. One-shot or realtime monitoring via drbdsetup events2
5.2.5. Connection states
5.2.6. Resource roles
5.2.7. Disk states
5.2.8. Connection information data
5.2.9. Performance indicators
5.3. Enabling and disabling resources
5.3.1. Enabling resources
5.3.2. Disabling resources
5.4. Reconfiguring resources
5.5. Promoting and demoting resources
5.6. Basic Manual Fail-over
5.7. Upgrading DRBD
5.7.1. General overview
5.7.2. Updating your repository
5.7.3. Checking the DRBD state
5.7.4. Pausing the cluster
5.7.5. Upgrading the packages
5.7.6. Loading the new Kernel module
5.7.7. Migrating your configuration files
5.7.8. Changing the meta-data
5.7.9. Starting DRBD again
5.7.10. From DRBD 9 to DRBD 9
5.8. Enabling dual-primary mode
5.8.1. Permanent dual-primary mode
5.8.2. Temporary dual-primary mode
5.9. Using on-line device verification
5.9.1. Enabling on-line verification
5.9.2. Invoking on-line verification
5.9.3. Automating on-line verification
5.10. Configuring the rate of synchronization
5.10.1. Estimating a synchronization speed
5.10.2. Variable sync rate configuration
5.10.3. Permanent fixed sync rate configuration
5.10.4. Some more hints about synchronization
5.11. Configuring checksum-based synchronization
5.12. Configuring congestion policies and suspended replication
5.13. Configuring I/O error handling strategies
5.14. Configuring replication traffic integrity checking
5.15. Resizing resources
5.15.1. Growing on-line
5.15.2. Growing off-line
5.15.3. Shrinking on-line
5.15.4. Shrinking off-line
5.16. Disabling backing device flushes
5.17. Configuring split brain behavior
5.17.1. Split brain notification
5.17.2. Automatic split brain recovery policies
5.18. Creating a stacked three-node setup
5.18.1. Device stacking considerations
5.18.2. Configuring a stacked resource
5.18.3. Enabling stacked resources
5.19. Data rebalancing
5.19.1. Prepare a bitmap slot
5.19.2. Preparing and activating the new node
5.19.3. Starting the initial sync
5.19.4. Check connectivity
5.19.5. After the initial sync
5.19.6. Cleaning up
5.19.7. Conclusion and further steps

This chapter outlines typical administrative tasks encountered during day-to-day operations. It does not cover troubleshooting tasks, these are covered in detail in Chapter 7, Troubleshooting and error recovery.

5.1. Configuring DRBD

5.1.1. Preparing your lower-level storage

After you have installed DRBD, you must set aside a roughly identically sized storage area on both cluster nodes. This will become the lower-level device for your DRBD resource. You may use any type of block device found on your system for this purpose. Typical examples include:

  • A hard drive partition (or a full physical hard drive),
  • a software RAID device,
  • an LVM Logical Volume or any other block device configured by the Linux device-mapper infrastructure,
  • any other block device type found on your system.

You may also use resource stacking, meaning you can use one DRBD device as a lower-level device for another. Some specific considerations apply to stacked resources; their configuration is covered in detail in Section 5.18, “Creating a stacked three-node setup”.

[Note]Note

While it is possible to use loop devices as lower-level devices for DRBD, doing so is not recommended due to deadlock issues.

It is not necessary for this storage area to be empty before you create a DRBD resource from it. In fact it is a common use case to create a two-node cluster from a previously non-redundant single-server system using DRBD (some caveats apply — please refer to Section 21.1, “DRBD meta data” if you are planning to do this).

For the purposes of this guide, we assume a very simple setup:

  • Both hosts have a free (currently unused) partition named /dev/sda7.
  • We are using internal meta data.

5.1.2. Preparing your network configuration

It is recommended, though not strictly required, that you run your DRBD replication over a dedicated connection. At the time of this writing, the most reasonable choice for this is a direct, back-to-back, Gigabit Ethernet connection. When DRBD is run over switches, use of redundant components and the bonding driver (in active-backup mode) is recommended.

It is generally not recommended to run DRBD replication via routers, for reasons of fairly obvious performance drawbacks (adversely affecting both throughput and latency).

In terms of local firewall considerations, it is important to understand that DRBD (by convention) uses TCP ports from 7788 upwards, with every resource listening on a separate port. DRBD uses two TCP connections for every resource configured. For proper DRBD functionality, it is required that these connections are allowed by your firewall configuration.

Security considerations other than firewalling may also apply if a Mandatory Access Control (MAC) scheme such as SELinux or AppArmor is enabled. You may have to adjust your local security policy so it does not keep DRBD from functioning properly.

You must, of course, also ensure that the TCP ports for DRBD are not already used by another application.

It is not (yet) possible to configure a DRBD resource to support more than one TCP connection pair. If you want to provide for DRBD connection load-balancing or redundancy, you can easily do so at the Ethernet level (again, using the bonding driver).

For the purposes of this guide, we assume a very simple setup:

  • Our two DRBD hosts each have a currently unused network interface, eth1, with IP addresses 10.1.1.31 and 10.1.1.32 assigned to it, respectively.
  • No other services are using TCP ports 7788 through 7799 on either host.
  • The local firewall configuration allows both inbound and outbound TCP connections between the hosts over these ports.

5.1.3. Configuring your resource

All aspects of DRBD are controlled in its configuration file, /etc/drbd.conf. Normally, this configuration file is just a skeleton with the following contents:

include "/etc/drbd.d/global_common.conf";
include "/etc/drbd.d/*.res";

By convention, /etc/drbd.d/global_common.conf contains the global and common sections of the DRBD configuration, whereas the .res files contain one resource section each.

It is also possible to use drbd.conf as a flat configuration file without any include statements at all. Such a configuration, however, quickly becomes cluttered and hard to manage, which is why the multiple-file approach is the preferred one.

Regardless of which approach you employ, you should always make sure that drbd.conf, and any other files it includes, are exactly identical on all participating cluster nodes.

The DRBD source tarball contains an example configuration file in the scripts subdirectory. Binary installation packages will either install this example configuration directly in /etc, or in a package-specific documentation directory such as /usr/share/doc/packages/drbd.

This section describes only those few aspects of the configuration file which are absolutely necessary to understand in order to get DRBD up and running. The configuration file’s syntax and contents are documented in great detail in drbd.conf(5).

5.1.3.1. Example configuration

For the purposes of this guide, we assume a minimal setup in line with the examples given in the previous sections:

Simple DRBD configuration (/etc/drbd.d/global_common.conf). 

global {
  usage-count yes;
}
common {
  net {
    protocol C;
  }
}

Simple DRBD resource configuration (/etc/drbd.d/r0.res). 

resource r0 {
  on alice {
    device    /dev/drbd1;
    disk      /dev/sda7;
    address   10.1.1.31:7789;
    meta-disk internal;
  }
  on bob {
    device    /dev/drbd1;
    disk      /dev/sda7;
    address   10.1.1.32:7789;
    meta-disk internal;
  }
}

This example configures DRBD in the following fashion:

  • You "opt in" to be included in DRBD’s usage statistics (see usage-count).
  • Resources are configured to use fully synchronous replication (Protocol C) unless explicitly specified otherwise.
  • Our cluster consists of two nodes, alice and bob.
  • We have a resource arbitrarily named r0 which uses /dev/sda7 as the lower-level device, and is configured with internal meta data.
  • The resource uses TCP port 7789 for its network connections, and binds to the IP addresses 10.1.1.31 and 10.1.1.32, respectively. (This implicitly defines the network connections that are used.)

The configuration above implicitly creates one volume in the resource, numbered zero (0). For multiple volumes in one resource, modify the syntax as follows (assuming that the same lower-level storage block devices are used on both nodes):

Multi-volume DRBD resource configuration (/etc/drbd.d/r0.res). 

resource r0 {
  volume 0 {
    device    /dev/drbd1;
    disk      /dev/sda7;
    meta-disk internal;
  }
  volume 1 {
    device    /dev/drbd2;
    disk      /dev/sda8;
    meta-disk internal;
  }
  on alice {
    address   10.1.1.31:7789;
  }
  on bob {
    address   10.1.1.32:7789;
  }
}

[Note]Note

Volumes may also be added to existing resources on the fly. For an example see Section 10.5, “Adding a new DRBD volume to an existing Volume Group”.

5.1.3.2. The global section

This section is allowed only once in the configuration. It is normally in the /etc/drbd.d/global_common.conf file. In a single-file configuration, it should go to the very top of the configuration file. Of the few options available in this section, only one is of relevance to most users:

usage-countThe DRBD project keeps statistics about the usage of various DRBD versions. This is done by contacting an HTTP server every time a new DRBD version is installed on a system. This can be disabled by setting usage-count no;. The default is usage-count ask; which will prompt you every time you upgrade DRBD.

DRBD’s usage statistics are, of course, publicly available: see http://usage.drbd.org.

5.1.3.3. The common section

This section provides a shorthand method to define configuration settings inherited by every resource. It is normally found in /etc/drbd.d/global_common.conf. You may define any option you can also define on a per-resource basis.

Including a common section is not strictly required, but strongly recommended if you are using more than one resource. Otherwise, the configuration quickly becomes convoluted by repeatedly-used options.

In the example above, we included net { protocol C; } in the common section, so every resource configured (including r0) inherits this option unless it has another protocol option configured explicitly. For other synchronization protocols available, see Section 2.3, “Replication modes”.

5.1.3.4. The resource sections

A per-resource configuration file is usually named /etc/drbd.d/resource.res. Any DRBD resource you define must be named by specifying a resource name in the configuration. The convention is to use only letters, digits, and the underscore; while it is technically possible to use other characters as well, you won’t like the result if you ever happen stumble to need the more specific peer@resource/volume syntax.

Every resource configuration must also have at least two on host sub-sections, one for every cluster node. All other configuration settings are either inherited from the common section (if it exists), or derived from DRBD’s default settings.

In addition, options with equal values on all hosts can be specified directly in the resource section. Thus, we can further condense our example configuration as follows:

resource r0 {
  device    /dev/drbd1;
  disk      /dev/sda7;
  meta-disk internal;
  on alice {
    address   10.1.1.31:7789;
  }
  on bob {
    address   10.1.1.32:7789;
  }
}

5.1.4. Defining network connections

Currently the communication links in DRBD 9 must build a full mesh, ie. in every resource every node must have a direct connection to every other node (excluding itself, of course).

For the simple case of two hosts drbdadm will insert the (single) network connection by itself, for ease of use and backwards compatibility.

The net effect of this is a quadratic number of network connections over hosts. For the "traditional" two nodes one connection is needed; for three hosts there are three node pairs; for four, six pairs; 5 hosts: 10 connections, and so on. For (the current) maximum of 16 nodes there’ll be 120 host pairs to connect.

Figure 5.1. Number of connections for N hosts

connection-mesh

An example configuration file for three hosts would be this:

resource r0 {
  device    /dev/drbd1;
  disk      /dev/sda7;
  meta-disk internal;
  on alice {
    address   10.1.1.31:7000;
    node-id   0;
  }
  on bob {
    address   10.1.1.32:7001;
    node-id   1;
  }
  on charlie {
    address   10.1.1.33:7002;
    node-id   2;
  }
  connection {
    host alice   port 7010;
    host bob     port 7001;
  }
  connection {
    host alice   port 7020;
    host charlie port 7002;
  }
  connection {
    host bob     port 7012;
    host charlie port 7021;
  }

}

The port for the address value within the on host sections is optional; but then you have to specify distinct port numbers for each connection.

[Note]Note

For this pre-release the whole connection mesh must be defined.

In the final release it will be sufficient to give each node a single port, and DRBD will figure out during the handshake which peer node it is talking to.

Alternatively, using the connection-mesh option, the same three node configuration:

resource r0 {
  device    /dev/drbd1;
  disk      /dev/sda7;
  meta-disk internal;
  on alice {
    address   10.1.1.31:7000;
    node-id   0;
  }
  on bob {
    address   10.1.1.32:7001;
    node-id   1;
  }
  on charlie {
    address   10.1.1.33:7002;
    node-id   2;
  }
  connection-mesh {
    hosts alice bob charlie;
    net {
        use-rle no;
    }
  }

}

If you have got enough network cards in your servers, you can create direct cross-over links between server pairs. A single four-port ethernet card allows to have a single management interface, and to connect 3 other servers, to get a full mesh for 4 cluster nodes.

In this case you can specify a different IP address to use the direct link:

resource r0 {
...
  connection {
    host alice   address 10.1.2.1 port 7010;
    host bob     address 10.1.2.2 port 7001;
  }
  connection {
    host alice   address 10.1.3.1 port 7020;
    host charlie address 10.1.3.2 port 7002;
  }
  connection {
    host bob     address 10.1.4.1 port 7021;
    host charlie address 10.1.4.2 port 7012;
  }
}

For easier maintenance and debugging it’s recommend to have different ports for each endpoint - looking at a tcpdump trace the packets can be associated easily.

The examples below will still be using two servers only; please see Section 5.1.9, “Example configuration for four nodes” for a four-node example.

5.1.5. Configuring transport implementations

DRBD supports multiple network transports. A transport implementation can be configured for each connection of a resource.

5.1.5.1. TCP/IP

resource <resource> {
  net {
    transport "tcp";
  }
  ...
}

tcp is the default transport. I.e. each connection that lacks a transport option uses the tcp transport.

The tcp transport can be configured with the net options: sndbuf-size, rcvbuf-size, connect-int, sock-check-timeo, ping-timeo, timeout.

5.1.5.2. RDMA

resource <resource> {
  net {
    transport "rdma";
  }
  ...
}

The rdma transport can be configured with the net options: sndbuf-size, rcvbuf-size, max_buffers, connect-int, sock-check-timeo, ping-timeo, timeout.

The rdma transport is a zero-copy-receive transport. One implication of that is that the max_buffers configuration options must be set to a value big enough to hold all rcvbuf-size.

[Note]Note

rcvbuf-size is configured in bytes, while max_buffers is configured in pages. For optimal performance max_buffers should be big enough to hold all of rcvbuf-size and the amount of data that might be in flight to the backend device at any point in time.

[Tip]Tip

In case you are using InfiniBand HCAs with the rdma transport, you need to configure IPoIB as well. The IP address is not used for data transfer, but it is used to find the right adapters and ports while establishing the connection.

[Caution]Caution

The configuration options sndbuf-size, rcvbuf-size are only considered at the time a connection is established. I.e. you can change them while the connection is established. They will take effect in the moment the connection is re-established.

5.1.5.3. Performance considerations for RDMA

By looking at the pseudo file /sys/kernel/debug/drbd/<resource>/connections/<peer>/transport, the counts of available receive descriptors (rx_desc) and transmit descriptors (tx_desc) can be monitored. In case one of the descriptor kinds becomes depleted you should increase sndbuf-size or rcvbuf-size.

5.1.6. Enabling your resource for the first time

After you have completed initial resource configuration as outlined in the previous sections, you can bring up your resource.

Each of the following steps must be completed on both nodes.

Please note that with our example config snippets (resource r0 { … }), <resource> would be r0.

Create device metadata. This step must be completed only on initial device creation. It initializes DRBD’s metadata:

# drbdadm create-md <resource>
v09 Magic number not found
Writing meta data...
initialising activity log
NOT initializing bitmap
New drbd meta data block sucessfully created.

Please note that the number of bitmap slots that are allocated in the meta-data depends on the number of hosts for this resource; per default the hosts in the resource configuration are counted. If all hosts are specified before creating the meta-data, this will "just work"; adding bitmap slots for further nodes is possible later, but incurs some manual work.

Enable the resource. This step associates the resource with its backing device (or devices, in case of a multi-volume resource), sets replication parameters, and connects the resource to its peer:

# drbdadm up <resource>

Observe the status via drbdadm statusdrbdsetup's status output should now contain information similar to the following:

# drbdadm status r0
r0 role:Secondary
  disk:Inconsistent
  bob role:Secondary
    disk:Inconsistent
[Note]Note

The Inconsistent/Inconsistent disk state is expected at this point.

By now, DRBD has successfully allocated both disk and network resources and is ready for operation. What it does not know yet is which of your nodes should be used as the source of the initial device synchronization.

5.1.7. The initial device synchronization

There are two more steps required for DRBD to become fully operational:

Select an initial sync source. If you are dealing with newly-initialized, empty disks, this choice is entirely arbitrary. If one of your nodes already has valuable data that you need to preserve, however, it is of crucial importance that you select that node as your synchronization source. If you do initial device synchronization in the wrong direction, you will lose that data. Exercise caution.

Start the initial full synchronization. This step must be performed on only one node, only on initial resource configuration, and only on the node you selected as the synchronization source. To perform this step, issue this command:

# drbdadm primary --force <resource>

After issuing this command, the initial full synchronization will commence. You will be able to monitor its progress via drbdadm status. It may take some time depending on the size of the device.

By now, your DRBD device is fully operational, even before the initial synchronization has completed (albeit with slightly reduced performance). If you started with empty disks you may now already create a filesystem on the device, use it as a raw block device, mount it, and perform any other operation you would with an accessible block device.

You will now probably want to continue with Part III, “Working with DRBD”, which describes common administrative tasks to perform on your resource.

5.1.8. Using truck based replication

In order to preseed a remote node with data which is then to be kept synchronized, and to skip the initial full device synchronization, follow these steps.

This assumes that your local node has a configured, but disconnected DRBD resource in the Primary role. That is to say, device configuration is completed, identical drbd.conf copies exist on both nodes, and you have issued the commands for initial resource promotion on your local node — but the remote node is not connected yet.

  • On the local node, issue the following command:

    # drbdadm new-current-uuid --clear-bitmap <resource>/<volume>

    or

    # drbdsetup new-current-uuid --clear-bitmap <minor>
  • Create a consistent, verbatim copy of the resource’s data and its metadata. You may do so, for example, by removing a hot-swappable drive from a RAID-1 mirror. You would, of course, replace it with a fresh drive, and rebuild the RAID set, to ensure continued redundancy. But the removed drive is a verbatim copy that can now be shipped off site. If your local block device supports snapshot copies (such as when using DRBD on top of LVM), you may also create a bitwise copy of that snapshot using dd.
  • On the local node, issue:

    # drbdadm new-current-uuid <resource>

    or the matching drbdsetup command.

    Note the absence of the --clear-bitmap option in this second invocation.

  • Physically transport the copies to the remote peer location.
  • Add the copies to the remote node. This may again be a matter of plugging a physical disk, or grafting a bitwise copy of your shipped data onto existing storage on the remote node. Be sure to restore or copy not only your replicated data, but also the associated DRBD metadata. If you fail to do so, the disk shipping process is moot.
  • On the new node we need to fix the node ID in the meta data, and exchange the peer-node info for the two nodes. Please see the following lines as example for changing node id from 2 to 1 on a resource r0 volume 0.

    This must be done while the volume is not in use.

    V=r0/0
    NODE_FROM=2
    NODE_TO=1
    
    drbdadm -- --force dump-md $V > /tmp/md_orig.txt
    sed -e "s/node-id $NODE_FROM/node-id $NODE_TO/" \
            -e "s/^peer.$NODE_FROM. /peer-NEW /" \
            -e "s/^peer.$NODE_TO. /peer[$NODE_FROM] /" \
            -e "s/^peer-NEW /peer[$NODE_TO] /" \
            < /tmp/md_orig.txt > /tmp/md.txt
    
    drbdmeta --force $(drbdadm sh-minor $V) v09 $(drbdadm sh-ll-dev $V) internal restore-md /tmp/md.txt

    NOTE. drbdmeta before 8.9.7 cannot cope with out-of-order peer sections; you’ll need to exchange the blocks via an editor.

  • Bring up the resource on the remote node:

    # drbdadm up <resource>

After the two peers connect, they will not initiate a full device synchronization. Instead, the automatic synchronization that now commences only covers those blocks that changed since the invocation of drbdadm --clear-bitmap new-current-uuid.

Even if there were no changes whatsoever since then, there may still be a brief synchronization period due to areas covered by the Activity Log being rolled back on the new Secondary. This may be mitigated by the use of checksum-based synchronization.

You may use this same procedure regardless of whether the resource is a regular DRBD resource, or a stacked resource. For stacked resources, simply add the -S or --stacked option to drbdadm.

5.1.9. Example configuration for four nodes

Here is an example for a four-node cluster.

Please note that the connection sections (and distinct ports) won’t be necessary for the DRBD 9.0.0 release; we will find some nice short-hand syntax.

resource r0 {
  device      /dev/drbd0;
  disk        /dev/vg/r0;
  meta-disk   internal;

  on store1 {
    address   10.1.10.1:7100;
    node-id   1;
  }
  on store2 {
    address   10.1.10.2:7100;
    node-id   2;
  }
  on store3 {
    address   10.1.10.3:7100;
    node-id   3;
  }
  on store4 {
    address   10.1.10.4:7100;
    node-id   4;
  }

  # All connections involving store1
  connection {
    host store1  port 7012;
    host store2  port 7021;
  }
  connection {
    host store1  port 7013;
    host store3  port 7031;
  }
  connection {
    host store1  port 7014;
    host store4  port 7041;
  }

  # All remaining connections involving store2
  connection {
    host store2  port 7023;
    host store3  port 7032;
  }
  connection {
    host store2  port 7024;
    host store4  port 7042;
  }

  # All remaining connections involving store3
  connection {
    host store3  port 7034;
    host store4  port 7043;
  }

  # store4 already done.
}

In contrast, the same configuration will be written like this:

resource r0 {
  device      /dev/drbd0;
  disk        /dev/vg/r0;
  meta-disk   internal;

  on store1 {
    address   10.1.10.1:7100;
    node-id   1;
  }
  on store2 {
    address   10.1.10.2:7100;
    node-id   2;
  }
  on store3 {
    address   10.1.10.3:7100;
    node-id   3;
  }
  on store4 {
    address   10.1.10.4:7100;
    node-id   4;
  }

  connection-mesh {
        hosts     store1 store2 store3 store4;
  }
}

In case you want to see the connection-mesh configuration expanded, try drbdadm dump <resource> -v.

As another example, if the four nodes have enough interfaces to provide a complete mesh via direct links[5], you can specify the IP addresses of the interfaces:

resource r0 {
  ...

  # store1 has crossover links like 10.99.1x.y
  connection {
    host store1  address 10.99.12.1 port 7012;
    host store2  address 10.99.12.2 port 7021;
  }
  connection {
    host store1  address 10.99.13.1  port 7013;
    host store3  address 10.99.13.3  port 7031;
  }
  connection {
    host store1  address 10.99.14.1  port 7014;
    host store4  address 10.99.14.4  port 7041;
  }

  # store2 has crossover links like 10.99.2x.y
  connection {
    host store2  address 10.99.23.2  port 7023;
    host store3  address 10.99.23.3  port 7032;
  }
  connection {
    host store2  address 10.99.24.2  port 7024;
    host store4  address 10.99.24.4  port 7042;
  }

  # store3 has crossover links like 10.99.3x.y
  connection {
    host store3  address 10.99.34.3  port 7034;
    host store4  address 10.99.34.4  port 7043;
  }
}

Please note the numbering scheme used for the IP addresses and ports. Another resource could use the same IP addresses, but ports 71xy, the next one 72xy, and so on.



[5] ie. three crossover and at least one outgoing/management interface