Part V. Optimizing DRBD performance

Chapter 15. Measuring block device performance

Measuring throughput

When measuring the impact of using DRBD on a system's I/O throughput, the absolute throughput the system is capable of is of little relevance. What is much more interesting is the relative impact DRBD has on I/O performance. Thus it is always necessary to measure I/O throughput both with and without DRBD.

[Caution]Caution

The tests described in this section are intrusive; they overwrite data and bring DRBD devices out of sync. It is thus vital that you perform them only on scratch volumes which can be discarded after testing has completed.

I/O throughput estimation works by writing significantly large chunks of data to a block device, and measuring the amount of time the system took to complete the write operation. This can be easily done using a fairly ubiquitous utility, dd, whose reasonably recent versions include a built-in throughput estimation.

A simple dd-based throughput benchmark, assuming you have a scratch resource named test which is currently connected and in the secondary role on both nodes, is one like the following:

TEST_RESOURCE=test
TEST_DEVICE=$(drbdadm sh-dev $TEST_RESOURCE)
TEST_LL_DEVICE=$(drbdadm sh-ll-dev $TEST_RESOURCE)
drbdadm primary $TEST_RESOURCE
for i in $(seq 5); do
  dd if=/dev/zero of=$TEST_DEVICE bs=512M count=1 oflag=direct
done
drbdadm down $TEST_RESOURCE
for i in $(seq 5); do
  dd if=/dev/zero of=$TEST_LL_DEVICE bs=512M count=1 oflag=direct
done

This test simply writes a 512M chunk of data to your DRBD device, and then to its backing device for comparison. Both tests are repeated 5 times each to allow for some statistical averaging. The relevant result is the throughput measurements generated by dd.

[Note]Note

For freshly enabled DRBD devices, it is normal to see significantly reduced performance on the first dd run. This is due to the Activity Log being cold, and is no cause for concern.

Measuring latency

Latency measurements have objectives completely different from throughput benchmarks: in I/O latency tests, one writes a very small chunk of data (ideally the smallest chunk of data that the system can deal with), and observes the time it takes to complete that write. The process is usually repeated several times to account for normal statistical fluctuations.

Just as throughput measurements, I/O latency measurements may be performed using the ubiquitous dd utility, albeit with different settings and an entirely different focus of observation.

Provided below is a simple dd-based latency micro-benchmark, assuming you have a scratch resource named test which is currently connected and in the secondary role on both nodes:

TEST_RESOURCE=test
TEST_DEVICE=$(drbdadm sh-dev $TEST_RESOURCE)
TEST_LL_DEVICE=$(drbdadm sh-ll-dev $TEST_RESOURCE)
drbdadm primary $TEST_RESOURCE
dd if=/dev/zero of=$TEST_DEVICE bs=512 count=1000 oflag=direct
drbdadm down $TEST_RESOURCE
dd if=/dev/zero of=$TEST_LL_DEVICE bs=512 count=1000 oflag=direct

This test writes 1,000 512-byte chunks of data to your DRBD device, and then to its backing device for comparison. 512 bytes is the smallest block size a Linux system (on all architectures except s390) is expected to handle.

It is important to understand that throughput measurements generated by dd are completely irrelevant for this test; what is important is the time elapsed during the completion of said 1,000 writes. Dividing this time by 1,000 gives the average latency of a single sector write.

Chapter 16. Optimizing DRBD throughput

This chapter deals with optimizing DRBD throughput. It examines some hardware considerations with regard to throughput optimization, and details tuning recommendations for that purpose.

Hardware considerations

DRBD throughput is affected by both the bandwidth of the underlying I/O subsystem (disks, controllers, and corresponding caches), and the bandwidth of the replication network.

  • I/O subsystem throughput. I/O subsystem throughput is determined, largely, by the number of disks that can be written to in parallel. A single, reasonably recent, SCSI or SAS disk will typically allow streaming writes of rougly 40MB/s to the single disk. When deployed in a striping configuration, the I/O subsystem will parallelize writes across disks, effectively multiplying a single disk's throughput by the number of stripes in the configuration. Thus the same, 40MB/s disks will allow effective throughput of 120MB/s in a RAID-0 or RAID-1+0 configuration with three stripes, or 200MB/s with five stripes.

    [Note]Note

    Disk mirroring (RAID-1) in hardware typically has little, if any effect on throughput. Disk striping with parity (RAID-5) does have an effect on throughput, usually an adverse one when compared to striping.

  • Network throughput. Network throughput is usually determined by the amount of traffic present on the network, and on the throughput of any routing/switching infrastructure present. These concerns are, however, largely irrelevant in DRBD replication links which are normally dedicated, back-to-back network connections.

    Thus, network throughput may be improved either by switching to a higher-throughput protocol (such as 10 Gigabit Ethernet), or by using link aggregation over several network links, as one may do using the Linux bonding network driver.

Throughput overhead expectations

When estimating the throughput overhead associated with DRBD, it is important to consider the following natural limitations:

  • DRBD throughput is limited by that of the raw I/O subsystem.

  • DRBD throughput is limited by the available network bandwidth.

The minimum between the two establishes the theoretical throughput maximum available to DRBD. DRBD then reduces that throughput maximum by its additional throughput overhead, which can be expected to be less than 3 percent.

  • Consider the example of two cluster nodes containing I/O subsystems capable of 200 MB/s throughput, with a Gigabit Ethernet link available between them. Gigabit Ethernet can be expected to produce 110 MB/s throughput for TCP connections, thus the network connection would be the bottleneck in this configuration and one would expect about 107 MB/s maximum DRBD throughput.

  • By contrast, if the I/O subsystem is capable of only 100 MB/s for sustained writes, then it constitutes the bottleneck, and you would be able to expect only 97 MB/s maximum DRBD throughput.

Tuning recommendations

DRBD offers a number of configuration options which may have an effect on your system's throughput. This section list some recommendations for tuning for throughput. However, since throughput is largely hardware dependent, the effects of tweaking the options described here may vary greatly from system to system. It is important to understand that these recommendations should not be interpreted as silver bullets which would magically remove any and all throughput bottlenecks.

Setting max-buffers and max-epoch-size

These options affect write performance on the secondary node. max-buffers is the maximum number of buffers DRBD allocates for writing data to disk while max-epoch-size is the maximum number of write requests permitted between two write barriers. Under most circumstances, these two options should be set in parallel, and to identical values. The default for both is 2048; setting it to around 8000 should be fine for most reasonably high-performance hardware RAID controllers.

resource resource {
  net {
    max-buffers 8000;
    max-epoch-size 8000;
    ...
  }
  ...
}

Tweaking the I/O unplug watermark

The I/O unplug watermark is a tunable which affects how often the I/O subsystem's controller is kicked (forced to process pending I/O requests) during normal operation. There is no universally recommended setting for this option; this is greatly hardware dependent.

Some controllers perform best when kicked frequently, so for these controllers it makes sense to set this fairly low, perhaps even as low as DRBD's allowable minimum (16). Others perform best when left alone; for these controllers a setting as high as max-buffers is advisable.

resource resource {
  net {
    unplug-watermark 16;
    ...
  }
  ...
}

Tuning the TCP send buffer size

The TCP send buffer is a memory buffer for outgoing TCP traffic. By default, it is set to a size of 128 KiB. For use in high-throughput networks (such as dedicated Gigabit Ethernet or load-balanced bonded connections), it may make sense to increase this to a size of 512 KiB, or perhaps even more. Send buffer sizes of more than 2 MiB are generally not recommended (and are also unlikely to produce any throughput improvement).

resource resource {
  net {
    sndbuf-size 512k;
    ...
  }
  ...
}

From DRBD 8.2.7 forward, DRBD also supports TCP send buffer auto-tuning. After enabling this feature, DRBD will dynamically select an appropriate TCP send buffer size. TCP send buffer auto tuning is enabled by simply setting the buffer size to zero:

resource resource {
  net {
    sndbuf-size 0;
    ...
  }
  ...
}

Tuning the Activity Log size

If the application using DRBD is write intensive in the sense that it frequently issues small writes scattered across the device, it is usually advisable to use a fairly large activity log. Otherwise, frequent metadata updates may be detrimental to write performance.

resource resource {
  syncer {
    al-extents 3389;
    ...
  }
  ...
}

Disabling barriers and disk flushes

[Warning]Warning

The recommendations outlined in this section should be applied only to systems with non-volatile (battery backed) controller caches.

Systems equipped with battery backed write cache come with built-in means of protecting data in the face of power failure. In that case, it is permissible to disable some of DRBD's own safeguards created for the same purpose. This may be beneficial in terms of throughput:

resource resource {
  disk {
    no-disk-barrier;
    no-disk-flushes;
    ...
  }
  ...
}

Chapter 17. Optimizing DRBD latency

This chapter deals with optimizing DRBD latency. It examines some hardware considerations with regard to latency minimization, and details tuning recommendations for that purpose.

Hardware considerations

DRBD latency is affected by both the latency of the underlying I/O subsystem (disks, controllers, and corresponding caches), and the latency of the replication network.

  • I/O subsystem latency. I/O subsystem latency is primarily a function of disk rotation speed. Thus, using fast-spinning disks is a valid approach for reducing I/O subsystem latency.

    Likewise, the use of a battery-backed write cache (BBWC) reduces write completion times, also reducing write latency. Most reasonable storage subsystems come with some form of battery-backed cache, and allow the administrator to configure which portion of this cache is used for read and write operations. The recommended approach is to disable the disk read cache completely and use all cache memory available for the disk write cache.

  • Network latency. Network latency is, in essence, the packet round-trip time (RTT) between hosts. It is influenced by a number of factors, most of which are irrelevant on the dedicated, back-to-back network connections recommended for use as DRBD replication links. Thus, it is sufficient to accept that a certain amount of latency always exists in Gigabit Ethernet links, which typically is on the order of 100 to 200 microseconds (μs) packet RTT.

    Network latency may typically be pushed below this limit only by using lower-latency network protocols, such as running DRBD over Dolphin Express using Dolphin SuperSockets.

Latency overhead expectations

As for throughput, when estimating the latency overhead associated with DRBD, there are some important natural limitations to consider:

  • DRBD latency is bound by that of the raw I/O subsystem.

  • DRBD latency is bound by the available network latency.

The sum of the two establishes the theoretical latency minimum incurred to DRBD. DRBD then adds to that latency a slight additional latency overhead, which can be expected to be less than 1 percent.

  • Consider the example of a local disk subsystem with a write latency of 3ms and a network link with one of 0.2ms. Then the expected DRBD latency would be 3.2 ms or a roughly 7-percent latency increase over just writing to a local disk.

[Note]Note

Latency may be influenced by a number of other factors, including CPU cache misses, context switches, and others.

Tuning recommendations

First and foremost, for optimal DRBD performance in environments where low latency — rather than high throughput — is crucial, it is strongly recommended to use DRBD 8.2.5 and above. The changeset for DRBD 8.2.5 contained a number of improvements particularly targeted at optimizing DRBD latency.

Setting DRBD's CPU mask

DRBD allows for setting an explicit CPU mask for its kernel threads. This is particularly beneficial for applications which would otherwise compete with DRBD for CPU cycles.

The CPU mask is a number in whose binary representation the least significant bit represents the first CPU, the second-least significant bit the second, and so forth. A set bit in the bitmask implies that the corresponding CPU may be used by DRBD, whereas a cleared bit means it must not. Thus, for example, a CPU mask of 1 (00000001) means DRBD may use the first CPU only. A mask of 12 (00001100) implies DRBD may use the third and fourth CPU.

An example CPU mask configuration for a resource may look like this:

resource resource {
  syncer {
    cpu-mask 2;
    ...
  }
  ...
}
[Important]Important

Of course, in order to minimize CPU competition between DRBD and the application using it, you need to configure your application to use only those CPUs which DRBD does not use.

Some applications may provide for this via an entry in a configuration file, just like DRBD itself. Others include an invocation of the taskset command in an application init script.

Modifying the network MTU

When a block-based (as opposed to extent-based) filesystem is layered above DRBD, it may be beneficial to change the replication network's maximum transmission unit (MTU) size to a value higher than the default of 1500 bytes. Colloquially, this is referred to as enabling Jumbo frames.

[Note]Note

Block-based file systems include ext3, ReiserFS (version 3), and GFS. Extent-based file systems, in contrast, include XFS, Lustre and OCFS2. Extent-based file systems are expected to benefit from enabling Jumbo frames only if they hold few and large files.

The MTU may be changed using the following commands:

ifconfig interface mtu size

or

ip link set interface mtu size

interface refers to the network interface used for DRBD replication. A typical value for size would be 9000 (bytes).

Enabling the deadline I/O scheduler

When used in conjunction with high-performance, write back enabled hardware RAID controllers, DRBD latency may benefit greatly from using the simple deadline I/O scheduler, rather than the CFQ scheduler. The latter is typically enabled by default in reasonably recent kernel configurations (post-2.6.18 for most distributions).

Modifications to the I/O scheduler configuration may be performed via the sysfs virtual file system, mounted at /sys. The scheduler configuration is in /sys/block/device, where device is the backing device DRBD uses.

Enabling the deadline scheduler works via the following command:

echo deadline > /sys/block/device/queue/scheduler

You may then also set the following values, which may provide additional latency benefits:

  • Disable front merges:

    echo 0 > /sys/block/device/queue/iosched/front_merges
  • Reduce read I/O deadline to 150 milliseconds (the default is 500ms):

    echo 150 > /sys/block/device/queue/iosched/read_expire
  • Reduce write I/O deadline to 1500 milliseconds (the default is 3000ms):

    echo 1500 > /sys/block/device/queue/iosched/write_expire

If these values effect a significant latency improvement, you may want to make them permanent so they are automatically set at system startup. Debian and Ubuntu systems provide this functionality via the sysfsutils package and the /etc/sysfs.conf configuration file.

You may also make a global I/O scheduler selection by passing the elevator option via your kernel command line. To do so, edit your boot loader configuration (normally found in /boot/grub/menu.lst if you are using the GRUB bootloader) and add elevator=deadline to your list of kernel boot options.