During a write operation DRBD forwards the write operation to the local backing block device, but also sends the data block over the network. These two actions occur, for all practical purposes, simultaneously. Random timing behavior may cause a situation where the write operation has been completed, but the transmission via the network has not yet taken place.
If, at this moment, the active node fails and fail-over is being initiated, then this data block is out of sync between nodes — it has been written on the failed node prior to the crash, but replication has not yet completed. Thus, when the node eventually recovers, this block must be removed from the data set of during subsequent synchronisation. Otherwise, the crashed node would be "one write ahead" of the surviving node, which would violate the “all or nothing” principle of replicated storage. This is an issue that is not limited to DRBD, in fact, this issue exists in practically all replicated storage configurations. Many other storage solutions (just as DRBD itself, prior to version 0.7) thus require that after a failure of the active, that node must be fully synchronized anew after its recovery.
DRBD's approach, since version 0.7, is a different one. The activity log (AL), stored in the meta data area, keeps track of those blocks that have “recently” been written to. Colloquially, these areas are referred to as hot extents.
If a temporarily failed node that was in active mode at the time of failure is synchronized, only those hot extents highlighted in the AL need to be synchronized, rather than the full device. This drastically reduces synchronization time after an active node crash.
The activity log has a configurable parameter, the number of active extents. Every active extent adds 4MiB to the amount of data being retransmitted after a Primary crash. This parameter must be understood as a compromise between the following opposites:
Many active extents. Keeping a large activity log improves write throughput. Every time a new extent is activated, an old extent is reset to inactive. This transition requires a write operation to the meta data area. If the number of active extents is high, old active extents are swapped out fairly rarely, reducing meta data write operations and thereby improving performance.
Few active extents. Keeping a small activity log reduces synchronization time after active node failure and subsequent recovery.
R is the synchronization
rate, given in MB/s.
tsync is the
target synchronization time, in seconds.
E is the resulting number of active
To provide an example, suppose our cluster has an I/O
subsystem with a throughput rate of 90 MiByte/s that was
configured to a synchronization rate of 30 MiByte/s
R=30), and we want to keep our
target synchronization time at 4 minutes or 240 seconds
The exact result is 1800, but since DRBD's hash function for the implementation of the AL works best if the number of extents is set to a prime number, we select 1801.