Part IV. DRBD-enabled applications

Chapter 8. Integrating DRBD with Pacemaker clusters

Using DRBD in conjunction with the Pacemaker cluster stack is arguably DRBD's most frequently found use case. Pacemaker is also one of the applications that make DRBD extremely powerful in a wide variety of usage scenarios.

[Important]Important

This chapter is relevant for Pacemaker versions 1.0.3 and above, and DRBD version 8.3.2 and above. It does not touch upon DRBD configuration in Pacemaker clusters of earlier versions.

Pacemaker is the direct, logical successor to the Heartbeat 2 cluster stack, and as far as the cluster resource manager infrastructure is concerned, a direct continuation of the Heartbeat 2 codebase. Since the intial stable release of Pacemaker, Heartbeat 2 can be considered obsolete and Pacemaker should be used instead.

For legacy configurations where the legacy Heartbeat 2 cluster manager must still be used, see Chapter 9, Integrating DRBD with Heartbeat clusters.

Pacemaker primer

Pacemaker is a sophisticated, feature-rich, and widely deployed cluster resource manager for the Linux platform. It comes with a rich set of documentation. In order to understand this chapter, reading the following documents is highly recommended:

Adding a DRBD-backed service to the cluster configuration

This section explains how to enable a DRBD-backed service in a Pacemaker cluster.

[Note]Note

If you are employing the DRBD OCF resource agent, it is recommended that you defer DRBD startup, shutdown, promotion, and demotion exclusively to the OCF resource agent. That means that you should disable the DRBD init script:

chkconfig drbd off

The drbd OCF resource agent provides Master/Slave capability, allowing Pacemaker to start and monitor the DRBD resource on multiple nodes and promoting and demoting as needed. You must, however, understand that the drbd RA disconnects and detaches all DRBD resources it manages on Pacemaker shutdown, and also upon enabling standby mode for a node.

[Important]Important

The OCF resource agent which ships with DRBD belongs to the linbit provider, and hence installs as /usr/lib/ocf/resource.d/linbit/drbd. This resource agent was bundled with DRBD in version 8.3.2 as a beta feature, and became fully supported in 8.3.4.

There is a legacy resource agent that shipped with Heartbeat 2, which uses the heartbeat provider and installs into /usr/lib/ocf/resource.d/heartbeat/drbd. Using the legacy OCF RA is not recommended.

In order to enable a DRBD-backed configuration for a MySQL database in a Pacemaker CRM cluster with the drbd OCF resource agent, you must create both the necessary resources, and Pacemaker constraints to ensure your service only starts on a previously promoted DRBD resource. You may do so using the crm shell, as outlined in the following example:

crm configure
crm(live)configure# primitive drbd_mysql ocf:linbit:drbd \
                    params drbd_resource="mysql" \
                    op monitor interval="15s"
crm(live)configure# ms ms_drbd_mysql drbd_mysql \
                    meta master-max="1" master-node-max="1" \
                         clone-max="2" clone-node-max="1" \
                         notify="true"
crm(live)configure# primitive fs_mysql ocf:heartbeat:Filesystem \
                    params device="/dev/drbd/by-res/mysql" directory="/var/lib/mysql" fstype="ext3"
crm(live)configure# primitive ip_mysql ocf:heartbeat:IPaddr2 \
                    params ip="10.9.42.1" nic="eth0"
crm(live)configure# primitive mysqld lsb:mysqld
crm(live)configure# group mysql fs_mysql ip_mysql mysqld
crm(live)configure# colocation mysql_on_drbd inf: mysql ms_drbd_mysql:Master
crm(live)configure# order mysql_after_drbd inf: ms_drbd_mysql:promote mysql:start
crm(live)configure# commit
crm(live)configure# exit
bye

After this, your configuration should be enabled. Pacemaker now selects a node on which it promotes the DRBD resource, and then starts the DRBD-backed resource group on that same node.

Using resource-level fencing in Pacemaker clusters

This section outlines the steps necessary to prevent Pacemaker from promoting a drbd Master/Slave resource when its DRBD replication link has been interrupted. This keeps Pacemaker from starting a service with outdated data and causing an unwanted time warp in the process.

[Important]Important

It is absolutely vital to configure at least two independent OpenAIS communication channels for this functionality to work correctly.

Furthermore, as mentioned in the section called “Adding a DRBD-backed service to the cluster configuration”, you should make sure the DRBD init script is disabled.

In order to enable resource-level fencing for Pacemaker, you will have to set two options in drbd.conf:

resource resource {
  disk {
    fencing resource-only;
    ...
  }
  handlers {
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
    ...
  }
  ...
}

Thus, if the DRBD replication link becomes disconnected, the crm-fence-peer.sh script contacts the cluster manager, determines the Pacemaker Master/Slave resource associated with this DRBD resource, and ensures that the Master/Slave resource no longer gets promoted on any node other than the currently active one. Conversely, when the connection is re-established and DRBD completes its synchronization process, then that constraint is removed and the cluster manager is free to promote the resource on any node again.

Using stacked DRBD resources in Pacemaker clusters

Stacked resources allow DRBD to be used for multi-level redundancy in multiple-node clusters, or to establish off-site disaster recovery capability. This section describes how to configure DRBD and Pacemaker in such configurations.

Adding off-site disaster recovery to Pacemaker clusters

In this configuration scenario, we would deal with a two-node high availability cluster in one site, plus a separate node which would presumably be housed off-site. The third node acts as a disaster recovery node and is a standalone server. Consider the following illustration to describe the concept.

Figure 8.1. DRBD resource stacking in Pacemaker clusters

DRBD resource stacking in Pacemaker clusters
3-node resource stacking in a Pacemaker cluster. Storage in light blue, connections to storage in red. Primary role in orange, Secondary in gray. Direction of replication indicated by arrows.

In this example, alice and bob form a two-node Pacemaker cluster, whereas charlie is an off-site node not managed by Pacemaker.

To create such a configuration, you would first configure and initialize DRBD resources as described in the section called “Creating a three-node setup”. Then, configure Pacemaker with the following CRM configuration:

primitive p_drbd_r0 ocf:linbit:drbd \
	params drbd_resource="r0"

primitive p_drbd_r0-U ocf:linbit:drbd \
	params drbd_resource="r0-U"

primitive p_ip_stacked ocf:heartbeat:IPaddr2 \
	params ip="192.168.42.1" nic="eth0"

ms ms_drbd_r0 p_drbd_r0 \
	meta master-max="1" master-node-max="1" \
        clone-max="2" clone-node-max="1" \
        notify="true" globally-unique="false"

ms ms_drbd_r0-U p_drbd_r0-U \
	meta master-max="1" clone-max="1" \
        clone-node-max="1" master-node-max="1" \
        notify="true" globally-unique="false"

colocation c_drbd_r0-U_on_drbd_r0 \
        inf: ms_drbd_r0-U ms_drbd_r0:Master

colocation c_drbd_r0-U_on_ip \
        inf: ms_drbd_r0-U p_ip_stacked

colocation c_ip_on_r0_master \
        inf: p_ip_stacked ms_drbd_r0:Master

order o_ip_before_r0-U \
        inf: p_ip_stacked ms_drbd_r0-U:start

order o_drbd_r0_before_r0-U \
        inf: ms_drbd_r0:promote ms_drbd_r0-U:start

Assuming you created this configuration in a temporary file named /tmp/crm.txt, you may import it into the live cluster configuration with the following command:

crm configure < /tmp/crm.txt

This configuration will ensure that the following actions occur in the correct order on the alice/bob cluster:

  1. Pacemaker starts the DRBD resource r0 on both cluster nodes, and promotes one node to the Master (DRBD Primary) role.

  2. Pacemaker then starts the IP address 192.168.42.1, which the stacked resource is to use for replication to the third node. It does so on the node it has previously promoted to the Master role for r0 DRBD resource.

  3. On the node which now has the Primary role for r0 and also the replication IP address for r0-U, Pacemaker now starts the r0-U DRBD resource, which connects and replicates to the off-site node.

  4. Pacemaker then promotes the r0-U resource to the Primary role too, so it can be used by an application.

Thus, this Pacemaker configuration ensures that there is not only full data redundancy between cluster nodes, but also to the third, off-site node.

[Note]Note

This type of setup is usually deployed together with DRBD Proxy.

Using stacked resources to achieve 4-way redundancy in Pacemaker clusters

In this configuration, a total of three DRBD resources (two unstacked, one stacked) are used to achieve 4-way storage redundancy. This means that of a 4-node cluster, up to three nodes can fail while still providing service availability.

Consider the following illustration to explain the concept.

Figure 8.2. DRBD resource stacking in Pacemaker clusters

DRBD resource stacking in Pacemaker clusters
4-node resource stacking in a Pacemaker cluster. Storage in light blue, connections to storage in red. Primary role in orange, Secondary in gray. Direction of replication indicated by arrows.

In this example, alice, bob, charlie, and daisy form two two-node Pacemaker clusters. alice and bob form the cluster named left and replicate data using a DRBD resource between them, while charlie and daisy do the same with a separate DRBD resource, in a cluster named right. A third, stacked DRBD resource connects the two clusters.

[Note]Note

Due to limitations in the Pacemaker cluster manager as of Pacemaker version 1.0.5, it is not possible to create this setup in a single four-node cluster without disabling CIB validation, which is an advanced process not recommended for general-purpose use. It is anticipated that this is being addressed in future Pacemaker releases.

To create such a configuration, you would first configure and initialize DRBD resources as described in the section called “Creating a three-node setup” (except that the remote half of the DRBD configuration is also stacked, not just the local cluster). Then, configure Pacemaker with the following CRM configuration, starting with the cluster left:

primitive p_drbd_left ocf:linbit:drbd \
	params drbd_resource="left"

primitive p_drbd_stacked ocf:linbit:drbd \
	params drbd_resource="stacked"

primitive p_ip_stacked_left ocf:heartbeat:IPaddr2 \
	params ip="10.9.9.100" nic="eth0"

ms ms_drbd_left p_drbd_left \
	meta master-max="1" master-node-max="1" \
        clone-max="2" clone-node-max="1" \
        notify="true"

ms ms_drbd_stacked p_drbd_stacked \
	meta master-max="1" clone-max="1" \
        clone-node-max="1" master-node-max="1" \
        notify="true" target-role="Master"

colocation c_ip_on_left_master \
        inf: p_ip_stacked_left ms_drbd_left:Master

colocation c_drbd_stacked_on_ip_left \
        inf: ms_drbd_stacked p_ip_stacked_left

order o_ip_before_stacked_left \
        inf: p_ip_stacked_left ms_drbd_stacked_left:start

order o_drbd_left_before_stacked_left \
        inf: ms_drbd_left:promote ms_drbd_stacked_left:start

Assuming you created this configuration in a temporary file named /tmp/crm.txt, you may import it into the live cluster configuration with the following command:

crm configure < /tmp/crm.txt

After adding this configuration to the CIB, Pacemaker will execute the following actions:

  1. Bring up the DRBD resource left replicating between alice and bob promoting the resource to the Master role on one of these nodes.

  2. Bring up the IP address 10.9.9.100 (on either alice or bob, depending on which of these holds the Master role for the resource left).

  3. Bring up the DRBD resource stacked on the same node that holds the just-configured IP address.

  4. Promote the stacked DRBD resource to the Primary role.

Now, proceed on the cluster right by creating the following configuration:

primitive p_drbd_right ocf:linbit:drbd \
	params drbd_resource="right"

primitive p_drbd_stacked ocf:linbit:drbd \
	params drbd_resource="stacked"

primitive p_ip_stacked_right ocf:heartbeat:IPaddr2 \
	params ip="10.9.10.101" nic="eth0"

ms ms_drbd_right p_drbd_right \
	meta master-max="1" master-node-max="1" \
        clone-max="2" clone-node-max="1" \
        notify="true"

ms ms_drbd_stacked p_drbd_stacked \
	meta master-max="1" clone-max="1" \
        clone-node-max="1" master-node-max="1" \
        notify="true" target-role="Slave"

colocation c_drbd_stacked_on_ip_right \
        inf: ms_drbd_stacked p_ip_stacked_right

colocation c_ip_on_right_master \
        inf: p_ip_stacked_right ms_drbd_right:Master

order o_ip_before_stacked_right \
        inf: p_ip_stacked_right ms_drbd_stacked_right:start

order o_drbd_right_before_stacked_right \
        inf: ms_drbd_right:promote ms_drbd_stacked_right:start

After adding this configuration to the CIB, Pacemaker will execute the following actions:

  1. Bring up the DRBD resource right replicating between charlie and daisy, promoting the resource to the Master role on one of these nodes.

  2. Bring up the IP address 10.9.10.101 (on either charlie or daisy, depending on which of these holds the Master role for the resource right).

  3. Bring up the DRBD resource stacked on the same node that holds the just-configured IP address.

  4. Leave the stacked DRBD resource in the Secondary role (due to target-role="Slave").

Configuring DRBD to replicate between two SAN-backed Pacemaker clusters

This is a somewhat advanced setup usually employed in split-site configurations. It involves two separate Pacemaker clusters, where each cluster has access to a separate Storage Area Network (SAN). DRBD is then used to replicate data stored on that SAN, across an IP link between sites.

Consider the following illustration to describe the concept.

Figure 8.3. Using DRBD to replicate between SAN-based clusters

Using DRBD to replicate between SAN-based clusters
DRBD floating peer configuration in a Pacemaker cluster. Storage in light blue, active connections to storage in red, inactive connections to storage in red dashed. Primary role in orange, Secondary in gray. Direction of replication indicated by arrows.


Which of the individual nodes in each site currently acts as the DRBD peer is not explicitly defined — the DRBD peers are said to float; that is, DRBD binds to virtual IP addresses not tied to a specific physical machine.

[Note]Note

This type of setup is usually deployed together with DRBD Proxy and/or truck based replication.

Since this type of setup deals with shared storage, configuring and testing STONITH is absolutely vital for it to work properly.

DRBD resource configuration

To enable your DRBD resource to float, configure it in drbd.conf in the following fashion:

resource resource {
  ...
  device /dev/drbd0;
  disk /dev/sda1;
  meta-disk internal;
  floating 10.9.9.100:7788;
  floating 10.9.10.101:7788;
}

The floating keyword replaces the on host sections normally found in the resource configuration. In this mode, DRBD identifies peers by IP address and TCP port, rather than by host name. It is important to note that the addresses specified must be virtual cluster IP addresses, rather than physical node IP addresses, for floating to function properly. As shown in the example, in split-site configurations the two floating addresses can be expected to belong to two separate IP networks — it is thus vital for routers and firewalls to properly allow DRBD replication traffic between the nodes.

Pacemaker resource configuration

A DRBD floating peers setup, in terms of Pacemaker configuration, involves the following items (in each of the two Pacemaker clusters involved):

  • A virtual cluster IP address.

  • A master/slave DRBD resource (using the DRBD OCF resource agent).

  • Pacemaker constraints ensuring that resources are started on the correct nodes, and in the correct order.

To configure a resource named mysql in a floating peers configuration in a 2-node cluster, using the replication address 10.9.9.100, configure Pacemaker with the following crm commands:

crm configure
crm(live)configure# primitive p_ip_float_left ocf:heartbeat:IPaddr2 \
                    params ip=10.9.9.100
crm(live)configure# primitive p_drbd_mysql ocf:linbit:drbd \
                    params drbd_resource=mysql
crm(live)configure# ms ms_drbd_mysql drbd_mysql \
                    meta master-max="1" master-node-max="1" \
                         clone-max="1" clone-node-max="1" \
                         notify="true" target-role="Master"
crm(live)configure# order drbd_after_left inf: p_ip_float_left ms_drbd_mysql
crm(live)configure# colocation drbd_on_left inf: ms_drbd_mysql p_ip_float_left
crm(live)configure# commit
bye

After adding this configuration to the CIB, Pacemaker will execute the following actions:

  1. Bring up the IP address 10.9.9.100 (on either alice or bob).

  2. Bring up the DRBD resource according to the IP address configured.

  3. Promote the DRBD resource to the Primary role.

Then, in order to create the matching configuration in the other cluster, configure that Pacemaker instance with the following commands:

crm configure
crm(live)configure# primitive p_ip_float_right ocf:heartbeat:IPaddr2 \
                    params ip=10.9.10.101
crm(live)configure# primitive drbd_mysql ocf:linbit:drbd \
                    params drbd_resource=mysql
crm(live)configure# ms ms_drbd_mysql drbd_mysql \
                    meta master-max="1" master-node-max="1" \
                         clone-max="1" clone-node-max="1" \
                         notify="true" target-role="Slave"
crm(live)configure# order drbd_after_right inf: p_ip_float_right ms_drbd_mysql
crm(live)configure# colocation drbd_on_right inf: ms_drbd_mysql p_ip_float_right
crm(live)configure# commit
bye

After adding this configuration to the CIB, Pacemaker will execute the following actions:

  1. Bring up the IP address 10.9.10.101 (on either charlie or daisy).

  2. Bring up the DRBD resource according to the IP address configured.

  3. Leave the DRBD resource in the Secondary role (due to target-role="Slave").

Site fail-over

In split-site configurations, it may be necessary to transfer services from one site to another. This may be a consequence of a scheduled transition, or of a disastrous event. In case the transition is a normal, anticipated event, the recommended course of action is this:

  • Connect to the cluster on the site about to relinquish resources, and change the affected DRBD resource's target-role attribute from Master to Slave. This will shut down any resources depending on the Primary role of the DRBD resource, demote it, and continue to run, ready to receive updates from a new Primary.

  • Connect to the cluster on the site about to take over resources, and change the affected DRBD resource's target-role attribute from Slave to Master. This will promote the DRBD resources, start any other Pacemaker resources depending on the Primary role of the DRBD resource, and replicate updates to the remote site.

  • To fail back, simply reverse the procedure.

In the event that of a catastrophic outage on the active site, it can be expected that the site is off line and no longer replicated to the backup site. In such an event:

  • Connect to the cluster on the still-functioning site resources, and change the affected DRBD resource's target-role attribute from Slave to Master. This will promote the DRBD resources, and start any other Pacemaker resources depending on the Primary role of the DRBD resource.

  • When the original site is restored or rebuilt, you may connect the DRBD resources again, and subsequently fail back using the reverse procedure.

Chapter 9. Integrating DRBD with Heartbeat clusters

[Important]Important

This chapter talks about DRBD in combination with the legacy Linux-HA cluster manager found in Heartbeat 2.0 and 2.1. That cluster manager has been superseded by Pacemaker and the latter should be used whenever possible — please see Chapter 8, Integrating DRBD with Pacemaker clusters for more information. This chapter outlines legacy Heartbeat configurations and is intended for users who must maintain existing legacy Heartbeat systems for policy reasons.

The Heartbeat cluster messaging layer, a distinct part of the Linux-HA project that continues to be supported as of Heartbeat version 3, is fine to use in conjunction with the Pacemaker cluster manager. More information about configuring Heartbeat can be found as part of the Linux-HA User's Guide at http://www.linux-ha.org/doc/.

Heartbeat primer

The Heartbeat cluster manager

Heartbeat's purpose as a cluster manager is to ensure that the cluster maintains its services to the clients, even if single machines of the cluster fail. Applications that may be managed by Heartbeat as cluster services include, for example,

  • a web server such as Apache,

  • a database server such as MySQL, Oracle, or PostgreSQL,

  • a file server such as NFS or Samba, and many others.

In essence, any server application may be managed by Heartbeat as a cluster service.

Services managed by Heartbeat are typically removed from the system startup configuration; rather than being started at boot time, the cluster manager starts and stops them as required by the cluster configuration and status. If a machine (a physical cluster node) fails while running a particular set of services, Heartbeat will start the failed services on another machine in the cluster. These operations performed by Heartbeat are commonly referred to as (automatic) fail-over.

A migration of cluster services from one cluster node to another, by manual intervention, is commonly termed "manual fail-over". This being a slightly self-contradictory term, we use the alternative term switch-over for the purposes of this guide.

Heartbeat is also capable of automatically migrating resources back to a previously failed node, as soon as the latter recovers. This process is called fail-back.

Heartbeat resources

Usually, there will be certain requirements in order to be able to start a cluster service managed by Heartbeat on a node. Consider the example of a typical database-driven web application:

  • Both the web server and the database server assume that their designated IP addresses are available (i.e. configured) on the node.

  • The database will require a file system to retrieve data files from.

  • That file system will require its underlying block device to read from and write to (this is where DRBD comes in, as we will see later).

  • The web server will also depend on the database being started, assuming it cannot serve dynamic content without an available database.

The services Heartbeat controls, and any additional requirements those services depend on, are referred to as resources in Heartbeat terminology. Where resources form a co-dependent collection, that collection is called a resource group.

Heartbeat resource agents

Heartbeat manages resources by way of invoking standardized shell scripts known as resource agents (RA's). In Heartbeat clusters, the following resource agent types are available:

  • Heartbeat resource agents. These agents are found in the /etc/ha.d/resource.d directory. They may take zero or more positional, unnamed parameters, and one operation argument (start, stop, or status). Heartbeat translates resource parameters it finds for a matching resource in /etc/ha.d/haresources into positional parameters for the RA, which then uses these to configure the resource.

  • LSB resource agents. These are conventional, Linux Standard Base-compliant init scripts found in /etc/init.d, which Heartbeat simply invokes with the start, stop, or status argument. They take no positional parameters. Thus, the corresponding resources' configuration cannot be managed by Heartbeat; these services are expected to be configured by conventional configuration files.

  • OCF resource agents. These are resource agents that conform to the guidelines of the Open Cluster Framework, and they only work with clusters in CRM mode. They are usually found in either /usr/lib/ocf/resource.d/heartbeat or /usr/lib64/ocf/resource.d/heartbeat, depending on system architecture and distribution. They take no positional parameters, but may be extensively configured via environment variables that the cluster management process derives from the cluster configuration, and passes in to the resource agent upon invocation.

Heartbeat communication channels

Heartbeat uses a UDP-based communication protocol to periodically check for node availability (the "heartbeat" proper). For this purpose, Heartbeat can use several communication methods, including:

  • IP multicast,

  • IP broadcast,

  • IP unicast,

  • serial line.

Of these, IP multicast and IP broadcast are the most relevant in practice. The absolute minimum requirement for stable cluster operation is two independent communication channels.

[Important]Important

A bonded network interface (a virtual aggregation of physical interfaces using the bonding driver) constitutes one Heartbeat communication channel.

Bonded links are not protected against bugs, known or as-yet-unknown, in the bonding driver. Also, bonded links are typically formed using identical network interface models, thus they are vulnerable to bugs in the NIC driver as well. Any such issue could lead to a cluster partition if no independent second Heartbeat communication channel were available.

It is thus not acceptable to omit the inclusion of a second Heartbeat link in the cluster configuration just because the first uses a bonded interface.

Heartbeat configuration

For any Heartbeat cluster, the following configuration files must be available:

  • /etc/ha.d/ha.cf — global cluster configuration.

  • /etc/ha.d/authkeys — keys for mutual node authentication.

Depending on whether Heartbeat is running in R1-compatible or in CRM mode, additional configuration files are required. These are covered in the section called “Using DRBD in Heartbeat R1-style clusters” and the section called “Using DRBD in Heartbeat CRM-enabled clusters”.

The ha.cf file

The following example is a small and simple ha.cf file:

autojoin none
mcast bond0 239.0.0.43 694 1 0
bcast eth2
warntime 5
deadtime 15
initdead 60
keepalive 2
node alice
node bob

Setting autojoin to none disables cluster node auto-discovery and requires that cluster nodes be listed explicitly, using the node options. This speeds up cluster start-up in clusters with a fixed number of nodes (which is always the case in R1-style Heartbeat clusters).

This example assumes that bond0 is the cluster's interface to the shared network, and that eth2 is the interface dedicated for DRBD replication between both nodes. Thus, bond0 can be used for Multicast heartbeat, whereas on eth2 broadcast is acceptable as eth2 is not a shared network.

The next options configure node failure detection. They set the time after which Heartbeat issues a warning that a no longer available peer node may be dead (warntime), the time after which Heartbeat considers a node confirmed dead (deadtime), and the maximum time it waits for other nodes to check in at cluster startup (initdead). keepalive sets the interval at which Heartbeat keep-alive packets are sent. All these options are given in seconds.

The node option identifies cluster members. The option values listed here must match the exact host names of cluster nodes as given by uname -n.

Not adding a crm option implies that the cluster is operating in R1-compatible mode with CRM disabled. If crm yes were included in the configuration, Heartbeat would be running in CRM mode.

The authkeys file

/etc/ha.d/authkeys contains pre-shared secrets used for mutual cluster node authentication. It should only be readable by root and follows this format:

auth num
num algorithm secret

num is a simple key index, starting with 1. Usually, you will only have one key in your authkeys file.

algorithm is the signature algorithm being used. You may use either md5 or sha1; the use of crc (a simple cyclic redundancy check, not secure) is not recommended.

secret is the actual authentication key.

You may create an authkeys file, using a generated secret, with the following shell hack:

( echo -ne "auth 1\n1 sha1 "; \
  dd if=/dev/urandom bs=512 count=1 | openssl md5 ) \
  > /etc/ha.d/authkeys
chmod 0600 /etc/ha.d/authkeys

Propagating the cluster configuration to cluster nodes

In order to propagate the contents of the ha.cf and authkeys configuration files, you may use the ha_propagate command, which you would invoke using either

/usr/lib/heartbeat/ha_propagate

or

/usr/lib64/heartbeat/ha_propagate

This utility will copy the configuration files over to any node listed in /etc/ha.d/ha.cf using scp. It will afterwards also connect to the nodes using ssh and issue chkconfig heartbeat on in order to enable Heartbeat services on system startup.

Using DRBD in Heartbeat R1-style clusters

Running Heartbeat clusters in release 1 compatible configuration is now considered obsolete by the Linux-HA development team. However, it is still widely used in the field, which is why it is documented here in this section.

Advantages. Configuring Heartbeat in R1 compatible mode has some advantages over using CRM configuration. In particular,

  • Heartbeat R1 compatible clusters are simple and easy to configure;

  • it is fairly straightforward to extend Heartbeat's functionality with custom, R1-style resource agents.

Disadvantages. Disadvantages of R1 compatible configuration, as opposed to CRM configurations, include:

  • Cluster configuration must be manually kept in sync between cluster nodes, it is not propagated automatically.

  • While node monitoring is available, resource-level monitoring is not. Individual resources must be monitored by an external monitoring system.

  • Resource group support is limited to two resource groups. CRM clusters, by contrast, support any number, and also come with a complex resource-level constraint framework.

Another disadvantage, namely the fact that R1 style configuration limits cluster size to 2 nodes (whereas CRM clusters support up to 255) is largely irrelevant for setups involving DRBD, DRBD itself being limited to two nodes.

Heartbeat R1-style configuration

In R1-style clusters, Heartbeat keeps its complete configuration in three simple configuration files:

The haresources file

The following is an example of a Heartbeat R1-compatible resource configuration involving a MySQL database backed by DRBD:

bob drbddisk::mysql Filesystem::/dev/drbd0::/var/lib/mysql::ext3 \
    10.9.42.1 mysql

This resource configuration contains one resource group whose home node (the node where its resources are expected to run under normal circumstances) is named bob. Consequentially, this resource group would be considered the local resource group on host bob, whereas it would be the foreign resource group on its peer host.

The resource group includes a DRBD resource named mysql, which will be promoted to the primary role by the cluster manager (specifically, the drbddisk resource agent) on whichever node is currently the active node. Of course, a corresponding resource must exist and be configured in /etc/drbd.conf for this to work.

That DRBD resource translates to the block device named /dev/drbd0, which contains an ext3 filesystem that is to be mounted at /var/lib/mysql (the default location for MySQL data files).

The resource group also contains a service IP address, 10.9.42.1. Heartbeat will make sure that this IP address is configured and available on whichever node is currently active.

Finally, Heartbeat will use the LSB resource agent named mysql in order to start the MySQL daemon, which will then find its data files at /var/lib/mysql and be able to listen on the service IP address, 192.168.42.1.

It is important to understand that the resources listed in the haresources file are always evaluated from left to right when resources are being started, and from right to left when they are being stopped.

Stacked resources in Heartbeat R1-style configurations

Available in DRBD version 8.3.0 and above

In three-way replication with stacked resources, it is usually desirable to have the stacked resource managed by Heartbeat just as other cluster resources. Then, your two-node cluster will manage the stacked resource as a floating resource that runs on whichever node is currently the active one in the cluster. The third node, which is set aside from the Heartbeat cluster, will have the other half of the stacked resource available permanently.

[Note]Note

To have a stacked resource managed by Heartbeat, you must first configure it as outlined in the section called “Configuring a stacked resource”.

The stacked resource is managed by Heartbeat by way of the drbdupper resource agent. That resource agent is distributed, as all other Heartbeat R1 resource agents, in /etc/ha.d/resource.d. It is to stacked resources what the drbddisk resource agent is to conventional, unstacked resources.

drbdupper takes care of managing both the lower-level resource and the stacked resource. Consider the following haresources example, which would replace the one given in the previous section:

bob 192.168.42.1 \
  drbdupper::mysql-U Filesystem::/dev/drbd1::/var/lib/mysql::ext3 \
  mysql

Note the following differences to the earlier example:

  • You start the cluster IP address before all other resources. This is necessary because stacked resource replication uses a connection from the cluster IP address to the node IP address of the third node. Lower-level resource replication, by contrast, uses a connection between the physical node IP addresses of the two cluster nodes.

  • You pass the stacked resource name to drbdupper (in this example, mysql-U).

  • You configure the Filesystem resource agent to mount the DRBD device associated with the stacked resource (in this example, /dev/drbd1), not the lower-level one.

Managing Heartbeat R1-style clusters

Assuming control of cluster resources

A Heartbeat R1-style cluster node may assume control of cluster resources in the following way:

Manual resource takeover. This is the approach normally taken if one simply wishes to test resource migration, or assume control of resources for any reason other than the peer having to leave the cluster. This operation is performed using the following command:

/usr/lib/heartbeat/hb_takeover

On some distributions and architectures, you may be required to enter:

/usr/lib64/heartbeat/hb_takeover

Relinquishing cluster resources

A Heartbeat R1-style cluster node may be forced to give up its resources in several ways.

  • Switching a cluster node to standby mode. This is the approach normally taken if one simply wishes to test resource migration, or perform some other activity that does not require the node to leave the cluster. This operation is performed using the following command:

    /usr/lib/heartbeat/hb_standby

    On some distributions and architectures, you may be required to enter:

    /usr/lib64/heartbeat/hb_standby

  • Shutting down the local cluster manager instance. This approach is suited for local maintenance operations such as software updates which require that the node be temporarily removed from the cluster, but which do not necessitate a system reboot. It involves shutting down all processes associated with the local cluster manager instance:

    /etc/init.d/heartbeat stop

    Prior to stopping its services, Heartbeat will gracefully migrate any currently running resources to the peer node. This is the approach to be followed, for example, if you are upgrading DRBD to a new release, without also upgrading your kernel.

  • Shutting down the local node. For hardware maintenance or other interventions that require a system shutdown or reboot, use a simple graceful shutdown command, such as

    reboot

    or

    poweroff

    Since Heartbeat services will be shut down gracefully in the process of a normal system shutdown, the previous paragraph applies to this situation, too. This is also the approach you would use in case of a kernel upgrade (which also requires the installation of a matching DRBD version).

Using DRBD in Heartbeat CRM-enabled clusters

Running Heartbeat clusters in CRM configuration mode is the recommended approach as of Heartbeat release 2 (per the Linux-HA development team).

Advantages. Advantages of using CRM configuration mode, as opposed to R1 compatible configuration, include:

  • Cluster configuration is distributed cluster-wide and automatically, by the Cluster Resource Manager. It need not be propagated manually.

  • CRM mode supports both node-level and resource-level monitoring, and configurable responses to both node and resource failure. It is still advisable to also monitor cluster resources using an external monitoring system.

  • CRM clusters support any number of resource groups, as opposed to Heartbeat R1-style clusters which only support two.

  • CRM clusters support a powerful (if complex) constraints framework. This enables you to ensure correct resource startup and shutdown order, resource co-location (forcing resources to always run on the same physical node), and to set preferred nodes for particular resources.

Another advantage, namely the fact that CRM clusters support up to 255 nodes in a single cluster, is somewhat irrelevant for setups involving DRBD (DRBD itself being limited to two nodes).

Disadvantages. Configuring Heartbeat in CRM mode also has some disadvantages in comparison to using R1-compatible configuration. In particular,

  • Heartbeat CRM clusters are comparatively complex to configure and administer;

  • Extending Heartbeat's functionality with custom OCF resource agents is non-trivial.

    [Note]Note

    This disadvantage is somewhat mitigated by the fact that you do have the option of using custom (or legacy) R1-style resource agents in CRM clusters.

Heartbeat CRM configuration

In CRM clusters, Heartbeat keeps part of configuration in the following configuration files:

The remainder of the cluster configuration is maintained in the Cluster Information Base (CIB), covered in detail in the following section. Contrary to the two relevant configuration files, the CIB need not be manually distributed among cluster nodes; the Heartbeat services take care of that automatically.

The Cluster Information Base

The Cluster Information Base (CIB) is kept in one XML file, /var/lib/heartbeat/crm/cib.xml. It is, however, not recommended to edit the contents of this file directly, except in the case of creating a new cluster configuration from scratch. Instead, Heartbeat comes with both command-line applications and a GUI to modify the CIB.

The CIB actually contains both the cluster configuration (which is persistent and is kept in the cib.xml file), and information about the current cluster status (which is volatile). Status information, too, may be queried either using Heartbeat command-line tools, and the Heartbeat GUI.

After creating a new Heartbeat CRM cluster — that is, creating the ha.cf and authkeys files, distributing them among cluster nodes, starting Heartbeat services, and waiting for nodes to establish intra-cluster communications — a new, empty CIB is created automatically. Its contents will be similar to this:

<cib>
   <configuration>
     <crm_config>
       <cluster_property_set id="cib-bootstrap-options">
         <attributes/>
       </cluster_property_set>
     </crm_config>
     <nodes>
       <node uname="alice" type="normal"
             id="f11899c3-ed6e-4e63-abae-b9af90c62283"/>
       <node uname="bob" type="normal"
             id="663bae4d-44a0-407f-ac14-389150407159"/>
     </nodes>
     <resources/>
     <constraints/>
   </configuration>
 </cib>

The exact format and contents of this file are documented at length on the Linux-HA web site, but for practical purposes it is important to understand that this cluster has two nodes named alice and bob, and that neither any resources nor any resource constraints have been configured at this point.

Adding a DRBD-backed service to the cluster configuration

This section explains how to enable a DRBD-backed service in a Heartbeat CRM cluster. The examples used in this section mimic, in functionality, those described in the section called “Heartbeat resources”, dealing with R1-style Heartbeat clusters.

The complexity of the configuration steps described in this section may seem overwhelming to some, particularly those having previously dealt only with R1-style Heartbeat configurations. While the configuration of Heartbeat CRM clusters is indeed complex (and sometimes not very user-friendly), the CRM's advantages may outweigh those of R1-style clusters. Which approach to follow is entirely up to the administrator's discretion.

Using the drbddisk resource agent in a Heartbeat CRM configuration

Even though you are using Heartbeat in CRM mode, you may still utilize R1-compatible resource agents such as drbddisk. This resource agent provides no secondary node monitoring, and ensures only resource promotion and demotion.

In order to enable a DRBD-backed configuration for a MySQL database in a Heartbeat CRM cluster with drbddisk, you would use a configuration like this:

<group ordered="true" collocated="true" id="rg_mysql">
  <primitive class="heartbeat" type="drbddisk"
             provider="heartbeat" id="drbddisk_mysql">
    <meta_attributes>
      <attributes>
        <nvpair name="target_role" value="started"/>
      </attributes>
    </meta_attributes>
    <instance_attributes>
      <attributes>
        <nvpair name="1" value="mysql"/>
      </attributes>
    </instance_attributes>
  </primitive>
  <primitive class="ocf" type="Filesystem"
             provider="heartbeat" id="fs_mysql">
    <instance_attributes>
      <attributes>
        <nvpair name="device" value="/dev/drbd0"/>
        <nvpair name="directory" value="/var/lib/mysql"/>
        <nvpair name="type" value="ext3"/>
      </attributes>
    </instance_attributes>
  </primitive>
  <primitive class="ocf" type="IPaddr2"
             provider="heartbeat" id="ip_mysql">
    <instance_attributes>
      <attributes>
        <nvpair name="ip" value="192.168.42.1"/>
        <nvpair name="cidr_netmask" value="24"/>
        <nvpair name="nic" value="eth0"/>
      </attributes>
    </instance_attributes>
  </primitive>
  <primitive class="lsb" type="mysqld"
             provider="heartbeat" id="mysqld"/>
</group>

Assuming you created this configuration in a temporary file named /tmp/hb_mysql.xml, you would add this resource group to the cluster configuration using the following command (on any cluster node):

cibadmin -o resources -C -x /tmp/hb_mysql.xml

After this, Heartbeat will automatically propagate the newly-configured resource group to all cluster nodes.

Using the drbd OCF resource agent in a Heartbeat CRM configuration

The drbd resource agent is a pure-bred OCF RA which provides Master/Slave capability, allowing Heartbeat to start and monitor the DRBD resource on multiple nodes and promoting and demoting as needed. You must, however, understand that the drbd RA disconnects and detaches all DRBD resources it manages on Heartbeat shutdown, and also upon enabling standby mode for a node.

In order to enable a DRBD-backed configuration for a MySQL database in a Heartbeat CRM cluster with the drbd OCF resource agent, you must create both the necessary resources, and Heartbeat constraints to ensure your service only starts on a previously promoted DRBD resource. It is recommended that you start with the constraints, such as shown in this example:

<constraints>
  <rsc_order id="mysql_after_drbd" from="rg_mysql" action="start"
             to="ms_drbd_mysql" to_action="promote" type="after"/>
  <rsc_colocation id="mysql_on_drbd" to="ms_drbd_mysql"
                  to_role="master" from="rg_mysql" score="INFINITY"/>
</constraints>

Assuming you put these settings in a file named /tmp/constraints.xml, here is how you would enable them:

cibadmin -U -x /tmp/constraints.xml

Subsequently, you would create your relevant resources:

<resources>
  <master_slave id="ms_drbd_mysql">
    <meta_attributes id="ms_drbd_mysql-meta_attributes">
      <attributes>
        <nvpair name="notify" value="yes"/>
        <nvpair name="globally_unique" value="false"/>
      </attributes>
    </meta_attributes>
    <primitive id="drbd_mysql" class="ocf" provider="heartbeat"
        type="drbd">
      <instance_attributes id="ms_drbd_mysql-instance_attributes">
        <attributes>
          <nvpair name="drbd_resource" value="mysql"/>
        </attributes>
      </instance_attributes>
      <operations id="ms_drbd_mysql-operations">
        <op id="ms_drbd_mysql-monitor-master"
	    name="monitor" interval="29s"
            timeout="10s" role="Master"/>
        <op id="ms_drbd_mysql-monitor-slave"
            name="monitor" interval="30s"
            timeout="10s" role="Slave"/>
      </operations>
    </primitive>
  </master_slave>
  <group id="rg_mysql">
    <primitive class="ocf" type="Filesystem"
               provider="heartbeat" id="fs_mysql">
      <instance_attributes id="fs_mysql-instance_attributes">
        <attributes>
          <nvpair name="device" value="/dev/drbd0"/>
          <nvpair name="directory" value="/var/lib/mysql"/>
          <nvpair name="type" value="ext3"/>
        </attributes>
      </instance_attributes>
    </primitive>
    <primitive class="ocf" type="IPaddr2"
               provider="heartbeat" id="ip_mysql">
      <instance_attributes id="ip_mysql-instance_attributes">
        <attributes>
          <nvpair name="ip" value="10.9.42.1"/>
          <nvpair name="nic" value="eth0"/>
        </attributes>
      </instance_attributes>
    </primitive>
    <primitive class="lsb" type="mysqld"
               provider="heartbeat" id="mysqld"/>
  </group>
</resources>

Assuming you put these settings in a file named /tmp/resources.xml, here is how you would enable them:

cibadmin -U -x /tmp/resources.xml

After this, your configuration should be enabled. Heartbeat now selects a node on which it promotes the DRBD resource, and then starts the DRBD-backed resource group on that same node.

Managing Heartbeat CRM clusters

Assuming control of cluster resources

A Heartbeat CRM cluster node may assume control of cluster resources in the following ways:

  • Manual takeover of a single cluster resource. This is the approach normally taken if one simply wishes to test resource migration, or move a resource to the local node as a means of manual load balancing. This operation is performed using the following command:

    crm_resource -r resource -M -H `uname -n`

    [Note]Note

    The -M (or --migrate) option for the crm_resource command, when used without the -H option, implies a resource migration away from the local host. You must initiate a migration to the local host by specifying the -H option, giving the local host name as the option argument.

    It is also important to understand that the migration is permanent, that is, unless told otherwise, Heartbeat will not move the resource back to a node it was previously migrated away from — even if that node happens to be the only surviving node in a near-cluster-wide system failure. This is undesirable under most circumstances. So, it is prudent to immediately un-migrate resources after successful migration, using the the following command:

    crm_resource -r resource -U

    Finally, it is important to know that during resource migration, Heartbeat may simultaneously migrate resources other than the one explicitly specified (as required by existing resource groups or colocation and order constraints).

  • Manual takeover of all cluster resources. This procedure involves switching the peer node to standby mode (where hostname is the peer node's host name):

    crm_standby -U hostname -v on

Relinquishing cluster resources

A Heartbeat CRM cluster node may be forced to give up one or all of its resources in several ways.

  • Giving up a single cluster resource. A node gives up control of a single resource when issued the following command (note that the considerations outlined in the previous section apply here, too):

    crm_resource -r resource -M 

    If you want to migrate to a specific host, use this variant:

    crm_resource -r resource -M -H hostname

    However, the latter syntax is usually of little relevance to CRM clusters using DRBD, DRBD being limited to two nodes (so the two variants are, essentially, identical in meaning).

  • Switching a cluster node to standby mode. This is the approach normally taken if one simply wishes to test resource migration, or perform some other activity that does not require the node to leave the cluster. This operation is performed using the following command:

    crm_standby -U `uname -n` -v on

  • Shutting down the local cluster manager instance. This approach is suited for local maintenance operations such as software updates which require that the node be temporarily removed from the cluster, but which do not necessitate a system reboot. The procedure is the same as for Heartbeat R1 style clusters.

  • Shutting down the local node. For hardware maintenance or other interventions that require a system shutdown or reboot, use a simple graceful shutdown command, just as previously outlined for Heartbeat R1 style clusters.

Using Heartbeat with dopd

The steps outlined in this section enable DRBD to deny services access to outdated data. The Heartbeat component that implements this functionality is the DRBD outdate-peer daemon, or dopd for short. It works, and uses identical configuration, on both R1-compatible and CRM clusters.

[Important]Important

It is absolutely vital to configure at least two independent Heartbeat communication channels for dopd to work correctly.

Heartbeat configuration

To enable dopd, you must add these lines to your /etc/ha.d/ha.cf file:

respawn hacluster /usr/lib/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster

You may have to adjust dopd's path according to your preferred distribution. On some distributions and architectures, the correct path is /usr/lib64/heartbeat/dopd.

After you have made this change and copied ha.cf to the peer node, you must run /etc/init.d/heartbeat reload to have Heartbeat re-read its configuration file. Afterwards, you should be able to verify that you now have a running dopd process.

[Note]Note

You can check for this process either by running ps ax | grep dopd or by issuing killall -0 dopd.

DRBD Configuration

Then, add these items to your DRBD resource configuration:

resource resource {
    handlers {
        fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
        ...
    }
    disk {
        fencing resource-only;
        ...
    }
    ...
}

As with dopd, your distribution may place the drbd-peer-outdater binary in /usr/lib64/heartbeat depending on your system architecture.

Finally, copy your drbd.conf to the peer node and issue drbdadm adjust resource to reconfigure your resource and reflect your changes.

Testing dopd functionality

To test whether your dopd setup is working correctly, interrupt the replication link of a configured and connected resource while Heartbeat services are running normally. You may do so simply by physically unplugging the network link, but that is fairly invasive. Instead, you may insert a temporary iptables rule to drop incoming DRBD traffic to the TCP port used by your resource.

After this, you will be able to observe the resource connection state change from Connected to WFConnection. Allow a few seconds to pass, and you should see the disk state become Outdated/DUnknown. That is what dopd is responsible for.

Any attempt to switch the outdated resource to the primary role will fail after this.

When re-instituting network connectivity (either by plugging the physical link or by removing the temporary iptables rule you inserted previously), the connection state will change to Connected, and then promptly to SyncTarget (assuming changes occurred on the primary node during the network interruption). Then you will be able to observe a brief synchronization period, and finally, the previously outdated resource will be marked as UpToDate again.

Chapter 10. Integrating DRBD with Red Hat Cluster Suite

This chapter describes using DRBD as replicated storage for Red Hat Cluster Suite high availability clusters.

[Note]Note

This guide deals primarily with Red Hat Cluster Suite as found in Red Hat Enterprise Linux (RHEL 5). If you are deploying DRBD on earlier versions such as RHEL 4, configuration details and semantics may vary.

Red Hat Cluster Suite primer

OpenAIS and CMAN

The Service Availability Forum is an industry consortium with the purpose of developing high availability interface definitions and software specifications. The Application Interface Specification (AIS) is one of these specifications, and OpenAIS is an open source AIS implementation maintained by a team staffed (primarily) by Red Hat employees. OpenAIS serves as Red Hat Cluster Suite's principal cluster communications infrastructure.

Specifically, Red Hat Cluster Suite makes use of the Totem group communication algorithm for reliable group messaging among cluster members.

Red Hat Cluster Suite in Red Hat Enterprise Linux (RHEL) version 5 adds an abstraction and convenience interface layer above OpenAIS named cman. cman also serves as a compatibility layer to RHEL 4, in which cman behaved similarly, albeit without utilizing OpenAIS.

CCS

The Cluster Configuration System (CCS) and its associated daemon, ccsd, maintains and updates the cluster configuration. Management applications utilize ccsd and the CCS libraries to query and update cluster configuration items.

Fencing

Red Hat Cluster Suite, originally designed primarily for shared storage clusters, relies on node fencing to prevent concurrent, uncoordinated access to shared resources. The Red Hat Cluster Suite fencing infrastructure relies on the fencing daemon fenced, and fencing agents implemented as shell scripts.

Even though DRBD-based clusters utilize no shared storage resources and thus fencing is not strictly required from DRBD's standpoint, Red Hat Cluster Suite still requires fencing even in DRBD-based configurations.

The Resource Group Manager

The resource group manager (rgmanager, alternatively clurgmgr) is akin to the Cluster Resource Manager in Heartbeat. It serves as the cluster management suite's primary interface with the applications it is configured to manage.

Red Hat Cluster Suite resources

A single highly available application, filesystem, IP address and the like is referred to as a resource in Red Hat Cluster Suite terminology.

Where resources depend on each other — such as, for example, an NFS export depending on a filesystem being mounted — they form a resource tree, a form of nesting resources inside another. Resources in inner levels of nesting may inherit parameters from resources in outer nesting levels. The concept of resource trees is absent in Heartbeat.

Red Hat Cluster Suite services

Where resources form a co-dependent collection, that collection is called a service. This is different from Heartbeat, where such a collection is referred to as a resource group.

rgmanager resource agents

The resource agents invoked by rgmanager are similar to those used by the Heartbeat CRM, in the sense that they utilize the same shell-based API as defined in the Open Cluster Framework (OCF), although Heartbeat utilizes some extensions not defined in the framework. Thus in theory, the resource agents are largely interchangeable between Red Hat Cluster Suite and Heartbeat — in practive however, the two cluster management suites use different resource agents even for similar or identical tasks.

Red Hat Cluster Suite resource agents install into the /usr/share/cluster directory. Unlike Heartbeat OCF resource agents which are by convention self-contained, some RHCS resource agents are split into a .sh file containing the actual shell code, and a .metadata file containing XML resource agent metadata.

Starting with version 8.3, DRBD includes a Red Hat Cluster Suite resource agent. It installs into the customary directory as drbd.sh and drbd.metadata.

Red Hat Cluster Suite configuration

This section outlines the configuration steps necessary to get Red Hat Cluster Suite running. Preparing your cluster configuration is fairly straightforward; all a DRBD-based RHCS cluster requires are two participating nodes (referred to as Cluster Members in Red Hat's documentation) and a fencing device.

[Note]Note

For more information about configuring Red Hat clusters, see Red Hat's documentation on the Red Hat Cluster Suite and GFS.

The cluster.conf file

RHEL clusters keep their configuration in a single configuration file, /etc/cluster/cluster.conf. You may manage your cluster configuration in the following ways:

  • Editing the configuration file directly. This is the most straightforward method. It has no prerequisites other than having a text editor available.

  • Using the system-config-cluster GUI. This is a GUI application written in Python using Glade. It requires the availability of an X display (either directly on a server console, or tunneled via SSH).

  • Using the Conga web-based management infrastructure.  The Conga infrastructure consists of a node agent (ricci) communicating with the local cluster manager, cluster resource manager, and cluster LVM daemon, and an administration web application (luci) which may be used to configure the cluster infrastructure using a simple web browser.

Using DRBD in RHCS fail-over clusters

[Note]Note

This section deals exclusively with setting up DRBD for RHCS fail over clusters not involving GFS. For GFS (and GFS2) configuration, please see Chapter 12, Using GFS with DRBD.

This section, like the corresponding section in the chapter on Heartbeat clusters, assumes you are about to configure a highly available MySQL database with the following configuration parameters:

  • The DRBD resources to be used as your database storage area is named mysql, and it manages the device /dev/drbd0.

  • The DRBD device holds an ext3 filesystem which is to be mounted to /var/lib/mysql (the default MySQL data directory).

  • The MySQL database is to utilize that filesystem, and listen on a dedicated cluster IP address, 192.168.42.1.

Setting up your cluster configuration

To configure your highly available MySQL database, create or modify your /etc/cluster/cluster.conf file to contain the following configuration items.

To do that, open /etc/cluster/cluster.conf with your preferred text editing application. Then, include the following items in your resource configuration:

<rm>
  <resources />
  <service autostart="1" name="mysql">
    <drbd name="drbd-mysql" resource="mysql">
      <fs device="/dev/drbd/by-res/mysql"
          mountpoint="/var/lib/mysql"
          fstype="ext3"
          name="mysql"
          options="noatime"/>
    </drbd>
    <ip address="10.9.9.180" monitor_link="1"/>
    <mysql config_file="/etc/my.cnf"
           listen_address="10.9.9.180"
           name="mysqld"/>
  </service>
</rm>

Nesting resource references inside one another in <service/> is the Red Hat Cluster way of expressing resource dependencies.

Be sure to increment the config_version attribute, found on the root <cluster> element, after you have completed your configuration. Then, issue the following commands to commit your changes to the running cluster configuration:

ccs_tool update /etc/cluster/cluster.conf
	cman_tool version -r version

In the second command, be sure to replace version with the new cluster configuration version number.

[Note]Note

Both the system-config-cluster GUI configuration utility and the Conga web based cluster management infrastructure will complain about your cluster configuration after including the drbd resource agent in your cluster.conf file. This is due to the design of the Python cluster management wrappers provided by these two applications which does not expect third party extensions to the cluster infrastructure.

Thus, when you utilize the drbd resource agent in cluster configurations, it is not recommended to utilize system-config-cluster nor Conga for cluster configuration purposes. Using either of these tools to only monitor the cluster's status, however, is expected to work fine.

Chapter 11. Using LVM with DRBD

This chapter deals with managing DRBD in conjunction with LVM2. In particular, it covers

  • using LVM Logical Volumes as backing devices for DRBD;

  • using DRBD devices as Physical Volumes for LVM;

  • combining these to concepts to implement a layered LVM approach using DRBD.

If you happen to be unfamiliar with these terms to begin with, the section called “LVM primer” may serve as your LVM starting point — although you are always encouraged, of course, to familiarize yourself with LVM in some more detail than this section provides.

LVM primer

LVM2 is an implementation of logical volume management in the context of the Linux device mapper framework. It has practically nothing in common, other than the name and acronym, with the original LVM implementation. The old implementation (now retroactively named "LVM1") is considered obsolete; it is not covered in this section.

When working with LVM, it is important to understand its most basic concepts:

  • Physical Volume (PV). A PV is an underlying block device exclusively managed by LVM. PVs can either be entire hard disks or individual partitions. It is common practice to create a partition table on the hard disk where one partition is dedicated to the use by the Linux LVM.

    [Note]Note

    The partition type "Linux LVM" (signature 0x8E) can be used to identify partitions for exclusive use by LVM. This, however, is not required — LVM recognizes PVs by way of a signature written to the device upon PV initialization.

  • Volume Group (VG). A VG is the basic administrative unit of the LVM. A VG may include one or more several PVs. Every VG has a unique name. A VG may be extended during runtime by adding additional PVs, or by enlarging an existing PV.

  • Logical Volume (LV). LVs may be created during runtime within VGs and are available to the other parts of the kernel as regular block devices. As such, they may be used to hold a file system, or for any other purpose block devices may be used for. LVs may be resized while they are online, and they may also be moved from one PV to another (as long as the PVs are part of the same VG).

  • Snapshot Logical Volume (SLV). Snapshots are temporary point-in-time copies of LVs. Creating snapshots is an operation that completes almost instantly, even if the original LV (the origin volume) has a size of several hundred GiByte. Usually, a snapshot requires significantly less space than the original LV.

Figure 11.1. LVM overview

LVM overview


Using a Logical Volume as a DRBD backing device

Since an existing Logical Volume is simply a block device in Linux terms, you may of course use it as a DRBD backing device. To use LV's in this manner, you simply create them, and then initialize them for DRBD as you normally would.

This example assumes that a Volume Group named foo already exists on both nodes of on your LVM-enabled system, and that you wish to create a DRBD resource named r0 using a Logical Volume in that Volume Group.

First, you create the Logical Volume:

lvcreate --name bar --size 10G foo
  Logical volume "bar" created

Of course, you must complete this command on both nodes of your DRBD cluster. After this, you should have a block device named /dev/foo/bar on either node.

Then, you can simply enter the newly-created volumes in your resource configuration:

resource r0 {
  ...
  on alice {
    device /dev/drbd0;
    disk   /dev/foo/bar;
    ...
  }
  on bob {
    device /dev/drbd0;
    disk   /dev/foo/bar;
    ...
  }
}

Now you can continue to bring your resource up, just as you would if you were using non-LVM block devices.

Configuring a DRBD resource as a Physical Volume

In order to prepare a DRBD resource for use as a Physical Volume, it is necessary to create a PV signature on the DRBD device. In order to do so, issue one of the following commands on the node where the resource is currently in the primary role:

pvcreate /dev/drbdnum

or

pvcreate /dev/drbd/by-res/resource

Now, it is necessary to include this device in the list of devices LVM scans for PV signatures. In order to do this, you must edit the LVM configuration file, normally named /etc/lvm/lvm.conf. Find the line in the devices section that contains the filter keyword and edit it accordingly. If all your PVs are to be stored on DRBD devices, the following is an appropriate filter option:

filter = [ "a|drbd.*|", "r|.*|" ]

This filter expression accepts PV signatures found on any DRBD devices, while rejecting (ignoring) all others.

[Note]Note

By default, LVM scans all block devices found in /dev for PV signatures. This is equivalent to filter = [ "a|.*|" ].

If you want to use stacked resources as LVM PVs, then you will need a more explicit filter configuration. You need to make sure that LVM detects PV signatures on stacked resources, while ignoring them on the corresponding lower-level resources and backing devices. This example assumes that your lower-level DRBD resources use device minors 0 through 9, whereas your stacked resources are using device minors from 10 upwards:[1]

filter = [ "a|drbd1[0-9]|", "r|.*|" ]

This filter expression accepts PV signatures found only on the DRBD devices /dev/drbd10 through /dev/drbd19, while rejecting (ignoring) all others.

After modifying the lvm.conf file, you must run the vgscan command so LVM discards its configuration cache and re-scans devices for PV signatures.

You may of course use a different filter configuration to match your particular system configuration. What is important to remember, however, is that you need to

  • Accept (include) the DRBD devices you wish to use as PVs;

  • Reject (exclude) the corresponding lower-level devices, so as to avoid LVM finding duplicate PV signatures.

In addition, you should disable the LVM cache by setting:

write_cache_state = 0

After disabling the LVM cache, make sure you remove any stale cache entries by deleting /etc/lvm/cache/.cache.

You must repeat the above steps on the peer node.

When you have configured your new PV, you may proceed to add it to a Volume Group, or create a new Volume Group from it. The DRBD resource must, of course, be in the primary role while doing so.

vgcreate name /dev/drbdnum
[Note]Note

While it is possible to mix DRBD and non-DRBD Physical Volumes within the same Volume Group, doing so is not recommended and unlikely to be of any practical value.

When you have created your VG, you may start carving Logical Volumes out of it, using the lvcreate command (as with a non-DRBD-backed Volume Group)



[1] This is an emerging convention for stacked resources.

Nested LVM configuration with DRBD

It is possible, if slightly advanced, to both use Logical Volumes as backing devices for DRBD and at the same time use a DRBD device itself as a Physical Volume. To provide an example, consider the following configuration:

  • We have two partitions, named /dev/sda1, and /dev/sdb1, which we intend to use as Physical Volumes.

  • Both of these PVs are to become part of a Volume Group named local.

  • We want to create a 10-GiB Logical Volume in this VG, to be named r0.

  • This LV will become the local backing device for our DRBD resource, also named r0, which corresponds to the device /dev/drbd0.

  • This device will be the sole PV for another Volume Group, named replicated.

  • This VG is to contain two more logical volumes named foo (4 GiB) and bar (6 GiB).

In order to enable this configuration, follow these steps:

  1. Set an appropriate filter option in your /etc/lvm/lvm.conf:

    filter = ["a|sd.*|", "a|drbd.*|", "r|.*|"]

    This filter expression accepts PV signatures found on any SCSI and DRBD devices, while rejecting (ignoring) all others.

    After modifying the lvm.conf file, you must run the vgscan command so LVM discards its configuration cache and re-scans devices for PV signatures.

  2. Disable the LVM cache by setting:

    write_cache_state = 0

    After disabling the LVM cache, make sure you remove any stale cache entries by deleting /etc/lvm/cache/.cache.

  3. Now, you may initialize your two SCSI partitions as PVs:

    pvcreate /dev/sda1
      Physical volume "/dev/sda1" successfully created
    pvcreate /dev/sdb1
      Physical volume "/dev/sdb1" successfully created
  4. The next step is creating your low-level VG named local, consisting of the two PVs you just initialized:

    vgcreate local /dev/sda1 /dev/sda2
      Volume group "local" successfully created
  5. Now you may create your Logical Volume to be used as DRBD's backing device:

    lvcreate --name r0 --size 10G local
      Logical volume "r0" created
  6. Repeat all steps, up to this point, on the peer node.

  7. Then, edit your /etc/drbd.conf to create a new resource named r0:

    resource r0 {
      device /dev/drbd0;
      disk /dev/local/r0;
      meta-disk internal;
      on host {
        address address:port;
      }
      on host {
        address address:port;
      }
    }

    After you have created your new resource configuration, be sure to copy your drbd.conf contents to the peer node.

  8. After this, initialize your resource as described in the section called “Enabling your resource for the first time” (on both nodes).

  9. Then, promote your resource (on one node):

    drbdadm primary r0
  10. Now, on the node where you just promoted your resource, initialize your DRBD device as a new Physical Volume:

    pvcreate /dev/drbd0
      Physical volume "/dev/drbd0" successfully created
  11. Create your VG named replicated, using the PV you just initialized, on the same node:

    vgcreate replicated /dev/drbd0
      Volume group "replicated" successfully created
  12. Finally, create your new Logical Volumes within this newly-created VG:

    lvcreate --name foo --size 4G replicated
      Logical volume "foo" created
    lvcreate --name bar --size 6G replicated
      Logical volume "bar" created

The Logical Volumes foo and bar will now be available as /dev/replicated/foo and /dev/replicated/bar on the local node.

To make them available on the peer node, first issue the following sequence of commands on the local node:

vgchange -a n replicated
  0 logical volume(s) in volume group "replicated" now active
drbdadm secondary r0

Then, issue these commands on the peer node:

drbdadm primary r0
vgchange -a y replicated
  2 logical volume(s) in volume group "replicated" now active

After this, the block devices /dev/replicated/foo and /dev/replicated/bar will be available on the peer node.

Of course, the process of transferring volume groups between peers and making the corresponding logical volumes available can be automated. The Heartbeat LVM resource agent is designed for exactly that purpose.

Chapter 12. Using GFS with DRBD

This chapter outlines the steps necessary to set up a DRBD resource as a block device holding a shared Global File System (GFS). It covers both GFS and GFS2.

In order to use GFS on top of DRBD, you must configure DRBD in dual-primary mode, which is available in DRBD 8.0 and later.

[Important]Important

All cluster file systems require fencing - not only via the DRBD resource, but STONITH! A faulty member must be killed.

You’ll want these settings:

disk {
        fencing resource-and-stonith;
}
handlers {
        outdate-peer "/sbin/make-sure-the-other-node-is-confirmed-dead.sh"
}

There must be no volatile caches! Please see https://fedorahosted.org/cluster/wiki/DRBD_Cookbook for some more information.

GFS primer

The Red Hat Global File System (GFS) is Red Hat's implementation of a concurrent-access shared storage file system. As any such filesystem, GFS allows multiple nodes to access the same storage device, in read/write fashion, simultaneously without risking data corruption. It does so by using a Distributed Lock Manager (DLM) which manages concurrent access from cluster members.

GFS was designed, from the outset, for use with conventional shared storage devices. Regardless, it is perfectly possible to use DRBD, in dual-primary mode, as a replicated storage device for GFS. Applications may benefit from reduced read/write latency due to the fact that DRBD normally reads from and writes to local storage, as opposed to the SAN devices GFS is normally configured to run from. Also, of course, DRBD adds an additional physical copy to every GFS filesystem, thus adding redundancy to the concept.

GFS makes use of a cluster-aware variant of LVM, termed Cluster Logical Volume Manager or CLVM. As such, some parallelism exists between using DRBD as the data storage for GFS, and using DRBD as a Physical Volume for conventional LVM.

GFS file systems are usually tightly integrated with Red Hat's own cluster management framework, the Red Hat Cluster Suite (RHCS). This chapter explains the use of DRBD in conjunction with GFS in the RHCS context.

GFS, CLVM, and the Red Hat Cluster Suite are available in Red Hat Enterprise Linux (RHEL) and distributions derived from it, such as CentOS. Packages built from the same sources are also available in Debian GNU/Linux. This chapter assumes running GFS on a Red Hat Enterprise Linux system.

Creating a DRBD resource suitable for GFS

Since GFS is a shared cluster file system expecting concurrent read/write storage access from all cluster nodes, any DRBD resource to be used for storing a GFS filesystem must be configured in dual-primary mode. Also, it is recommended to use some of DRBD's features for automatic recovery from split brain. And, it is necessary for the resource to switch to the primary role immediately after startup. To do all this, include the following lines in the resource configuration:

resource resource {
  startup {
    become-primary-on both;
    ...
  }
  net {
    allow-two-primaries;
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    ...
  }
  ...
}

Once you have added these options to your freshly-configured resource, you may initialize your resource as you normally would. Since the allow-two-primaries option is set for this resource, you will be able to promote the resource to the primary role on both nodes.

Configuring LVM to recognize the DRBD resource

GFS uses CLVM, the cluster-aware version of LVM, to manage block devices to be used by GFS. In order to use CLVM with DRBD, ensure that your LVM configuration

Configuring your cluster to support GFS

After you have created your new DRBD resource and completed your initial cluster configuration, you must enable and start the following system services on both nodes of your GFS cluster:

  • cman (this also starts ccsd and fenced),

  • clvmd.

Creating a GFS filesystem

In order to create a GFS filesystem on your dual-primary DRBD resource, you must first initialize it as a Logical Volume for LVM.

Contrary to conventional, non-cluster-aware LVM configurations, the following steps must be completed on only one node due to the cluster-aware nature of CLVM:

pvcreate /dev/drbd/by-res/resource
  Physical volume "/dev/drbdnum" successfully created
vgcreate vg-name /dev/drbd/by-res/resource
  Volume group "vg-name" successfully created
lvcreate --size size --name lv-name vg-name
  Logical volume "lv-name" created

CLVM will immediately notify the peer node of these changes; issuing lvs (or lvdisplay) on the peer node will list the newly created logical volume.

Now, you may proceed by creating the actual filesystem:

mkfs -t gfs -p lock_dlm -j 2 /dev/vg-name/lv-name

Or, for a GFS2 filesystem:

mkfs -t gfs2 -p lock_dlm -j 2 -t cluster:name /dev/vg-name/lv-name

The -j option in this command refers to the number of journals to keep for GFS. This must be identical to the number of nodes in the GFS cluster; since DRBD does not support more than two nodes, the value to set here is always 2.

The -t option, applicable only for GFS2 filesystems, defines the lock table name. This follows the format cluster:name, where cluster must match your cluster name as defined in /etc/cluster/cluster.conf. Thus, only members of that cluster will be permitted to use the filesystem. By contrast, name is an arbitrary file system name unique in the cluster.

Using your GFS filesystem

After you have created your filesystem, you may add it to /etc/fstab:

/dev/vg-name/lv-name mountpoint gfs defaults 0 0

For a GFS2 filesystem, simply change the defined filesystem type to:

/dev/vg-name/lv-name mountpoint gfs2 defaults 0 0

Do not forget to make this change on both cluster nodes.

After this, you may mount your new filesystem by starting the gfs service (on both nodes):

service gfs start

From then onwards, as long as you have DRBD configured to start automatically on system startup, before the RHCS services and the gfs service, you will be able to use this GFS file system as you would use one that is configured on traditional shared storage.

Chapter 13. Using OCFS2 with DRBD

This chapter outlines the steps necessary to set up a DRBD resource as a block device holding a shared Oracle Cluster File System, version 2 (OCFS2).

[Important]Important

All cluster file systems require fencing - not only via the DRBD resource, but STONITH! A faulty member must be killed.

You’ll want these settings:

disk {
        fencing resource-and-stonith;
}
handlers {
        outdate-peer "/sbin/make-sure-the-other-node-is-confirmed-dead.sh"
}

There must be no volatile caches! You might take a few hints of the page at https://fedorahosted.org/cluster/wiki/DRBD_Cookbook, although that’s about GFS2, not OCFS2.

OCFS2 primer

The Oracle Cluster File System, version 2 (OCFS2) is a concurrent access shared storage file system developed by Oracle Corporation. Unlike its predecessor OCFS, which was specifically designed and only suitable for Oracle database payloads, OCFS2 is a general-purpose filesystem that implements most POSIX semantics. The most common use case for OCFS2 is arguably Oracle Real Application Cluster (RAC), but OCFS2 may also be used for load-balanced NFS clusters, for example.

Although originally designed for use with conventional shared storage devices, OCFS2 is equally well suited to be deployed on dual-Primary DRBD. Applications reading from the filesystem may benefit from reduced read latency due to the fact that DRBD reads from and writes to local storage, as opposed to the SAN devices OCFS2 otherwise normally runs on. In addition, DRBD adds redundancy to OCFS2 by adding an additional copy to every filesystem image, as opposed to just a single filesystem image that is merely shared.

Like other shared cluster file systems such as GFS, OCFS2 allows multiple nodes to access the same storage device, in read/write mode, simultaneously without risking data corruption. It does so by using a Distributed Lock Manager (DLM) which manages concurrent access from cluster nodes. The DLM itself uses a virtual file system (ocfs2_dlmfs) which is separate from the actual OCFS2 file systems present on the system.

OCFS2 may either use an intrinsic cluster communication layer to manage cluster membership and filesystem mount and unmount operation, or alternatively defer those tasks to the Pacemaker cluster infrastructure.

OCFS2 is available in SUSE Linux Enterprise Server (where it is the primarily supported shared cluster file system), CentOS, Debian GNU/Linux, and Ubuntu Server Edition. Oracle also provides packages for Red Hat Enterprise Linux (RHEL). This chapter assumes running OCFS2 on a SUSE Linux Enterprise Server system.

Creating a DRBD resource suitable for OCFS2

Since OCFS2 is a shared cluster file system expecting concurrent read/write storage access from all cluster nodes, any DRBD resource to be used for storing a OCFS2 filesystem must be configured in dual-primary mode. Also, it is recommended to use some of DRBD's features for automatic recovery from split brain. And, it is necessary for the resource to switch to the primary role immediately after startup. To do all this, include the following lines in the resource configuration:

resource resource {
  startup {
    become-primary-on both;
    ...
  }
  net {
    # allow-two-primaries;
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    ...
  }
  ...
}

It is not recommended to enable the allow-two-primaries option upon initial configuration. You should do so after the initial resource synchronization has completed.

Once you have added these options to your freshly-configured resource, you may initialize your resource as you normally would. After you enable the allow-two-primaries option for this resource, you will be able to promote the resource to the primary role on both nodes.

Creating an OCFS2 filesystem

Now, use OCFS2's mkfs implementation to create the file system:

mkfs -t ocfs2 -N 2 -L ocfs2_drbd0 /dev/drbd0
	mkfs.ocfs2 1.4.0
Filesystem label=ocfs2_drbd0
Block size=1024 (bits=10)
Cluster size=4096 (bits=12)
Volume size=205586432 (50192 clusters) (200768 blocks)
7 cluster groups (tail covers 4112 clusters, rest cover 7680 clusters)
Journal size=4194304
Initial number of node slots: 2
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 0 block(s)
Formatting Journals: done
Writing lost+found: done
mkfs.ocfs2 successful

This will create an OCFS2 file system with two node slots on /dev/drbd0, and set the filesystem label to ocfs2_drbd0. You may specify other options on mkfs invocation; please see the mkfs.ocfs2 system manual page for details.

Pacemaker OCFS2 management

Adding a Dual-Primary DRBD resource to Pacemaker

An existing Dual-Primary DRBD resource may be added to Pacemaker resource management with the following crm configuration:

primitive p_drbd_ocfs2 ocf:linbit:drbd \
params drbd_resource="ocfs2"
ms ms_drbd_ocfs2 p_drbd_ocfs2 meta master-max=2 clone-max=2
[Important]Important

Note the master-max=2 meta variable; it enables dual-Master mode for a Pacemaker master/slave set. This requires that allow-two-primaries is also set in the DRBD configuration. Otherwise, Pacemaker will flag a configuration error during resource validation.

Adding OCFS2 management capability to Pacemaker

In order to manage OCFS2 and the kernel Distributed Lock Manager (DLM), Pacemaker uses a total of three different resource agents:

  • ocf:pacemaker:controld — Pacemaker's interface to the DLM;

  • ocf:ocfs2:o2cb — Pacemaker's interface to OCFS2 cluster management;

  • ocf:heartbeat:Filesystem — the generic filesystem management resource agent which supports cluster file systems when configured as a Pacemaker clone.

You may enable all nodes in a Pacemaker cluster for OCFS2 management by creating a cloned group of resources, with the following crm configuration:

primitive p_controld ocf:pacemaker:controld
primitive p_o2cb ocf:ocfs2:o2cb
group g_ocfs2mgmt p_controld p_o2cb
clone cl_ocfs2mgmt g_ocfs2mgmt meta interleave=true

Once this configuration is committed, Pacemaker will start instances of the controld and o2cb resource types on all nodes in the cluster.

Adding an OCFS2 filesystem to Pacemaker

Pacemaker manages OCFS2 filesystems using the conventional ocf:heartbeat:Filesystem resource agent, albeit in clone mode. To put an OCFS2 filesystem under Pacemaker management, use the following crm configuration:

primitive p_fs_ocfs2 ocf:heartbeat:Filesystem \
  params device="/dev/drbd/by-res/ocfs2" directory="/srv/ocfs2" \
         fstype="ocfs2" options="rw,noatime"
clone cl_fs_ocfs2 p_fs_ocfs2

Adding required Pacemaker constraints to manage OCFS2 filesystems

In order to tie all OCFS2-related resources and clones together, add the following contraints to your Pacemaker configuration:

order o_ocfs2 ms_drbd_ocfs2:promote cl_ocfs2mgmt:start cl_fs_ocfs2:start
colocation c_ocfs2 cl_fs_ocfs2 cl_ocfs2mgmt ms_drbd_ocfs2:Master

Legacy OCFS2 management (without Pacemaker)

[Important]Important

The information presented in this section applies to legacy systems where OCFS2 DLM support is not available in Pacemaker. It is preserved here for reference purposes only. New installations should always use the Pacemaker approach.

Configuring your cluster to support OCFS2

Creating the configuration file

OCFS2 uses a central configuration file, /etc/ocfs2/cluster.conf.

When creating your OCFS2 cluster, be sure to add both your hosts to the cluster configuration. The default port (7777) is usually an acceptable choice for cluster interconnect communications. If you choose any other port number, be sure to choose one that does not clash with an existing port used by DRBD (or any other configured TCP/IP).

If you feel less than comfortable editing the cluster.conf file directly, you may also use the ocfs2console graphical configuration utility which is usually more convenient. Regardless of the approach you selected, your /etc/ocfs2/cluster.conf file contents should look roughly like this:

node:
    ip_port = 7777
    ip_address = 10.1.1.31
    number = 0
    name = alice
    cluster = ocfs2

node:
    ip_port = 7777
    ip_address = 10.1.1.32
    number = 1
    name = bob
    cluster = ocfs2

cluster:
    node_count = 2
    name = ocfs2

When you have configured you cluster configuration, use scp to distribute the configuration to both nodes in the cluster.

Configuring the O2CB driver

  • SUSE Linux Enterprise systems. On SLES, you may utilize the configure option of the o2cb init script:

    /etc/init.d/o2cb configure
    Configuring the O2CB driver.
    
    This will configure the on-boot properties of the O2CB driver.
    The following questions will determine whether the driver is loaded on
    boot.  The current values will be shown in brackets ('[]').  Hitting
    <ENTER> without typing an answer will keep that current value.  Ctrl-C
    will abort.
    
    Load O2CB driver on boot (y/n) [y]:
    Cluster to start on boot (Enter "none" to clear) [ocfs2]:
    Specify heartbeat dead threshold (>=7) [31]:
    Specify network idle timeout in ms (>=5000) [30000]:
    Specify network keepalive delay in ms (>=1000) [2000]:
    Specify network reconnect delay in ms (>=2000) [2000]:
    Use user-space driven heartbeat? (y/n) [n]:
    Writing O2CB configuration: OK
    Loading module "configfs": OK
    Mounting configfs filesystem at /sys/kernel/config: OK
    Loading module "ocfs2_nodemanager": OK
    Loading module "ocfs2_dlm": OK
    Loading module "ocfs2_dlmfs": OK
    Mounting ocfs2_dlmfs filesystem at /dlm: OK
    Starting O2CB cluster ocfs2: OK

  • Debian GNU/Linux systems. On Debian, the configure option to /etc/init.d/o2cb is not available. Instead, reconfigure the ocfs2-tools package to enable the driver:

    dpkg-reconfigure -p medium -f readline ocfs2-tools
    Configuring ocfs2-tools
    -----------------------
    Would you like to start an OCFS2 cluster (O2CB) at boot time? yes
    Name of the cluster to start at boot time: ocfs2
    The O2CB heartbeat threshold sets up the maximum time in seconds that a node
    awaits for an I/O operation. After it, the node "fences" itself, and you will
    probably see a crash.
    
    It is calculated as the result of: (threshold - 1) x 2.
    
    Its default value is 31 (60 seconds).
    
    Raise it if you have slow disks and/or crashes with kernel messages like:
    
    o2hb_write_timeout: 164 ERROR: heartbeat write timeout to device XXXX after NNNN
    milliseconds
    O2CB Heartbeat threshold: 31
    		Loading filesystem "configfs": OK
    Mounting configfs filesystem at /sys/kernel/config: OK
    Loading stack plugin "o2cb": OK
    Loading filesystem "ocfs2_dlmfs": OK
    Mounting ocfs2_dlmfs filesystem at /dlm: OK
    Setting cluster stack "o2cb": OK
    Starting O2CB cluster ocfs2: OK
    

Using your OCFS2 filesystem

When you have completed cluster configuration and created your file system, you may mount it as any other file system:

mount -t ocfs2 /dev/drbd0 /shared

Your kernel log (accessible by issuing the command dmesg) should then contain a line similar to this one:

ocfs2: Mounting device (147,0) on (node 0, slot 0) with ordered data mode.

From that point forward, you should be able to simultaneously mount your OCFS2 filesystem on both your nodes, in read/write mode.

Chapter 14. Using Xen with DRBD

This chapter outlines the use of DRBD as a Virtual Block Device (VBD) for virtualization envirnoments using the Xen hypervisor.

Xen primer

Xen is a virtualization framework originally developed at the University of Cambridge (UK), and later being maintained by XenSource, Inc. (now a part of Citrix). It is included in reasonably recent releases of most Linux distributions, such as Debian GNU/Linux (since version 4.0), SUSE Linux Enterprise Server (since release 10), Red Hat Enterprise Linux (since release 5), and many others.

Xen uses paravirtualization — a virtualization method involving a high degree of cooperation between the virtualization host and guest virtual machines — with selected guest operating systems for improved performance in comparison to conventional virtualization solutions (which are typically based on hardware emulation). Xen also supports full hardware emulation on CPUs that support the appropriate virtualization extensions, in Xen parlance, this is known as HVM (hardware-assisted virtual machine).

[Note]Note

At the time of writing, CPU extensions supported by Xen for HVM are Intel's Virtualization Technology (VT, formerly codenamed Vanderpool), and AMD's Secure Virtual Machine (SVM, formerly known as Pacifica).

Xen supports live migration, which refers to the capability of transferring a running guest operating system from one physical host to another, without interruption.

When a DRBD resource is used as a replicated Virtual Block Device (VBD) for Xen, it serves to make the entire contents of a domU's virtual disk available on two servers, which can then be configured for automatic fail-over. That way, DRBD does not only provide redundancy for Linux servers (as in non-virtualized DRBD deployment scenarios), but also for any other operating system that can be virtualized under Xen — which, in essence, includes any operating system available on 32- or 64-bit Intel compatible architectures.

Setting DRBD module parameters for use with Xen

For Xen Domain-0 kernels, it is recommended to load the DRBD module with the disable_sendpage set to 1. To do so, create (or open) the file /etc/modprobe.d/drbd.conf and enter the following line:

options drbd disable_sendpage=1
[Note]Note

The disable_sendpage parameter is available in DRBD 8.3.2 and later.

Creating a DRBD resource suitable to act as a Xen VBD

Configuring a DRBD resource that is to be used as a Virtual Block Device for Xen is fairly straightforward — in essence, the typical configuration matches that of a DRBD resource being used for any other purpose. However, if you want to enable live migration for your guest instance, you need to enable dual-primary mode for this resource:

resource resource {
  net {
    allow-two-primaries;
    ...
  }
  ...
}

Enabling dual-primary mode is necessary because Xen, before initiating live migration, checks for write access on all VBDs a resource is configured to use on both the source and the destination host for the migration.

Using DRBD VBDs

In order to use a DRBD resource as the virtual block device, you must add a line like the following to your Xen domU configuration:

disk = [ 'drbd:resource,xvda,w' ]

This example configuration makes the DRBD resource named resource available to the domU as /dev/xvda in read/write mode (w).

Of course, you may use multiple DRBD resources with a single domU. In that case, simply add more entries like the one provided in the example to the disk option, separated by commas.

[Note]Note

There are three sets of circumstances under which you cannot use this approach:

  • You are configuring a fully virtualized (HVM) domU.

  • You are installing your domU using a graphical installation utility, and that graphical installer does not support the drbd: syntax.

  • You are configuring a domU without the kernel, initrd, and extra options, relying instead on bootloader and bootloader_args to use a Xen pseudo-bootloader, and that pseudo-bootloader does not support the drbd: syntax.

    pygrub (prior to Xen 3.3) and domUloader.py (shipped with Xen on SUSE Linux Enterprise Server 10) are two examples of pseudo-bootloaders that do not support the drbd: virtual block device configuration syntax.

    pygrub from Xen 3.3 forward, and the domUloader.py version that ships with SLES 11 do support this syntax.

Under these circumstances, you must use the traditional phy: device syntax and the DRBD device name that is associated with your resource, not the resource name. That, however, requires that you manage DRBD state transitions outside Xen, which is a less flexible approach than that provided by the drbd resource type.

Starting, stopping, and migrating DRBD-backed domU's

Starting the domU. Once you have configured your DRBD-backed domU, you may start it as you would any other domU:

xm create domU
Using config file "/etc/xen/domU".
Started domain domU

In the process, the DRBD resource you configured as the VBD will be promoted to the primary role, and made accessible to Xen as expected.

Stopping the domU. This is equally straightforward:

xm shutdown -w domU
Domain domU terminated.

Again, as you would expect, the DRBD resource is returned to the secondary role after the domU is successfully shut down.

Migrating the domU. This, too, is done using the usual Xen tools:

xm migrate --live domU destination-host

In this case, several administrative steps are automatically taken in rapid succession:

  1. The resource is promoted to the primary role on destination-host.

  2. Live migration of domU is initiated on the local host.

  3. When migration to the destination host has completed, the resource is demoted to the secondary role locally.

The fact that both resources must briefly run in the primary role on both hosts is the reason for having to configure the resource in dual-primary mode in the first place.

Internals of DRBD/Xen integration

Xen supports two Virtual Block Device types natively:

  • phyThis device type is used to hand "physical" block devices, available in the host environment, off to a guest domU in an essentially transparent fashion.

  • fileThis device type is used to make file-based block device images available to the guest domU. It works by creating a loop block device from the original image file, and then handing that block device off to the domU in much the same fashion as the phy device type does.

If a Virtual Block Device configured in the disk option of a domU configuration uses any prefix other than phy:, file:, or no prefix at all (in which case Xen defaults to using the phy device type), Xen expects to find a helper script named block-prefix in the Xen scripts directory, commonly /etc/xen/scripts.

The DRBD distribution provides such a script for the drbd device type, named /etc/xen/scripts/block-drbd. This script handles the necessary DRBD resource state transitions as described earlier in this chapter.

Integrating Xen with Heartbeat

In order to fully capitalize on the benefits provided by having a DRBD-backed Xen VBD's, it is recommended to have Heartbeat manage the associated domU's as Heartbeat resources.

CRM configuration mode. If you are using the Heartbeat cluster manager (in CRM configuration mode), you may configure a Xen domU as a Heartbeat resource, and automate resource failover. To do so, use the Xen OCF resource agent. If you are using the drbd Xen device type described in this chapter, you will not need to configure any separate drbd resource for use by the Xen cluster resource. Instead, the block-drbd helper script will do all the necessary resource transitions for you.

R1-compatible configuration mode. If you are using Heartbeat in R1-compatible mode, you cannot use OCF resource agents. You may, however, configure Heartbeat to use the xendomains LSB service as a cluster resource. Again, if you are using the drbd Xen VBD type, you will not need to create separate drbddisk resources for your Xen domains.