
DRBD leads to highly available data using affordable commodity hardware Every service depends on some data. DRBD makes your data highly available using commodity hardware components. by Lars Ellenberg
Note: The original article dealt with drbd-0.6, and was published in Linux-Magazine 2003/11, see also: http://www.linux-mag.com/2003-11/toc.html[1] For any feedback, please contact the author or the
mailing list[2].
The Distributed Replicated Block Device (DRBD[3]) system can save the day, your data, and your job. DRBD provides data redundancy at a fraction of the cost of other solutions.
Typical resynchronization time after connection loss or crash is independent of total storage size, but a function of the (configurable!) active set size. We always only resynchronize intelligently those regions that have actually been modified.
of it for at least one year
Though you can use raw devices for special purposes, the typical direct client to a block device is a filesystem. It is recommended to use one of the journaling filesystems, i.e. ext3 or reiserfs, or xfs if you like.
LinBit[4] provides binary packages for support customers for most "Enterprise" distributions, or for any distribution upon request.
See also: http://www.drbd.org/download.html[5] SuSE officially does include drbd and heartbeat in its standard distributions, as well as in its fully supported SuSE Linux Enterprise Sever (SLES) 9. The most recent "unofficial" SuSE packages can be found in Lars Marowsky-Brée's subtree: ftp.suse.com/pub/people/lmb/drbd and its mirrors. For Debian users, thanks to David Krovich, the currently best resource is probably:
deb http://fsrc.csee.wvu.edu/debian/apt-repository binary/ deb-src http://fsrc.csee.wvu.edu/debian/apt-repository source/
Please refer to DRBD/HowTo/Install[6].
if not, there is a well commented one in the drbd/scripts subdirectory[7].
This configuration file divides into at most one global{} section, and "arbitrary" many resource [resource id] {} sections, where [resource id] is typically something like drbd2 or r1, but may be any valid identifier (alphanum string). In the global section, you can specify how many drbds you want to
be able to configure minor-count, in case you want to define more resources later without reloading the module (which would interrupt services).
Each resource{} section further splits into resource settings partially grouped as startup{}, disk{}, net{} and syncer{} specific, and node specific settings, which are grouped in on [hostname] {} subsections.
Parameters you need to change are hostname, drbd device, the lower level physical disk to use, the meta-disk (and index, if not internal), and Internet address and port number. For further details refer to "drbd.conf details" below.
Note that you must not ever access the lower level device while you are using drbd. You do not mount the lower level device any longer, you mount the virtual drb-device!
If you have any troubles setting up DRBD, check http://www.drbd.org[8], and if that does not help, feel free to subscribe and ask questions on the mailing list[2]. If you feel that write throughput is way too low, try to identify the bottleneck. Sustained write throughput cannot be better than the minimum of your underlying disk hardware and network throughputs. Make sure you enabled DMA mode for IDE disks (hdparm -d1 /dev/hdX). Note
that network bandwidth is typically given as bits, not bytes, so 100MBit FastEthernet has a maximum bandwidth of 12.5MB/s, and that's without the protocol overhead, one way, and with only your data on the wire. For short, synchronous writes, it is typically not bandwidth, but latency, which kills your performance, because local disk, network, and remote disk latencies add to each other.
Please refer to DRBD/QuickStart07[9].
paul IPaddr::192.168.99.99/24/eth0 drbddisk::r0 \
Filesystem::/dev/drbd0::/mnt/ha0::ext3 smbd Now you can bring down for maintenance the PDC of your Win Net (a SAMBA server, of course), or your main web, database or file server, without anyone noticing it, since it was HA clustered using heartbeat and DRBD...
Do not attempt to mount a drbd in Secondary state. On 2.6 kernels, we don't allow it. Though (on 2.4 kernels) it is still possible to mount a Secondary device readonly, changes
made to the Primary are mirrored to it underneath the filesystem and buffer-cache of the Secondary, so you won't see changes on the Secondary. And changing meta-data underneath a filesystem is a risky habit, since it may confuse your kernel to death. So don't do that.
Symptoms would be loads of Assert (mdev->state == Primary) in syslog. However, work is underway to support true shared disk semantics for use with cluster aware file systems such as GFS.
Sponsors please contact office@linbit.com ...
Once you setup DRBD, never -- as in never!! -- bypass it, or access the underlying device directly, unless it is the last chance to recover data after some worst case event. If you for some reason need to start a cluster in degraded mode,
do so with the drbd start and drbddisk start commands, then use the services as normal. To make sure the first sync is in the direction you expect after you've rebuilt the other node, make sure that your good copy is in Primary state.
If necessary, you can say drbdadm invalidate on the bad copy.
Rather contact office@linbit.com and explain your needs, they will find a solution for you.
earthquake at your primary location. You want to use protocol A and a huge sndbuf-size here, and probably adjust the timeout, too.
Think about privacy! Since with DRBD the complete disk content goes over the wire, if this wire is not a crossover cable but the (supposedly hostile) Internet, you should route DRBD traffic through some virtual private network (VPN).
Make sure no one other than the partner node can access the DRBD ports, or someone might provoke a connection loss, and then race for the first reconnect, to get a full sync of your disk's content.
resource r0 {
on hostname1 {
device /dev/drbd0;
disk /dev/sda3;
meta-disk internal;
address 192.168.77.1:7788;
}
on hostname2 {
device /dev/drbd0;
disk /dev/hdc;
meta-disk internal;
address 192.168.77.2:7788;
}
}
should match exactly what uname -n reports on the respective nodes, case is significant.
the device node to use, typically /dev/drbd# . Obviously needs to be unique wit^in the configuration.
meta-disk is either internal or /dev/ice/name [idx] You can use a single block device to store meta-data of multiple DRBD's.
E.g. use meta-disk /dev/hde6[0]; and meta-disk /dev/hde6[1]; for two different resources. In this case, hde6 would need to be at least 256 MB in size.
internal means, that the last 128 MB (aligned to 4K) of the lower device are used to store the meta-data. You must not give an index with internal.
local TCP send buffer. (see also sndbuf-size in the net{} section) As this violates O_SYNC semantics even more than B, this will lose transactions on fail-over!
reached both local and remote disk.
or to man drbd.conf (which is hopefully up-to-date and correct). If you cannot find it, you can view it as drbd.conf[10] in the online subversion repository.
server, and Silas is standby. In the normal state, Paul and Silas are up and running. If one of them is down, the cluster is degraded. If both only believe the other node is dead, this is split-brain[11] -- Heartbeat tries to avoid this by using as many communication paths as possible.
Typical state changes are degraded -> normal and normal -> degraded.
log extents, al-extents in the syncer{} section), independently of the actual storage size, this limits the typical resync time to "(active region size)/(resync bandwidth)", regardless of the storage size. In the more likely case that Silas took over the active role, when Paul comes back, he will become the sync target, this time receiving the previously active regions, plus those blocks that have been modified on Silas while he was alone. If both nodes were down (main power failure or something), after the cluster reboot, the situation is similar: resync of the previously active regions from Paul to Silas. Now it seems like whenever one node was down we need to resynchronize. This is not exactly true. You can stop the services on Paul, unmount the drbd, and make it Secondary. The cluster is then connected, but both nodes are passive/standby. You can now either shutdown both nodes cleanly in any order. When they see each other the next time, there will be no sync at all, since they know from their meta-data, that both have been Secondary last time, and they belong to the same "generation", thus the data is still identical. Or you can assign Silas the active role now, make drbd Primary on Silas, mount it, and start the services. This way you can bring down Paul for
maintenance. (hb_standby should do this for you, too.)
If you really need availability, and don't care about possibly inconsistent, out-of-date data, you can tell Silas to become Primary anyways. It will refuse to become Primary at first, but with the explicit operator override drbdsetup /dev/drbd0 primary --do-what-I-say you can force it to. Since you used brute force, you take the blame.
| [1] | http://www.linux-mag.com/2003-11/toc.html |
| [2] | http://lists.linbit.com/listinfo/drbd-user |
| [3] | http://www.linux-ha.org/DRBD |
| [4] | http://www.linux-ha.org/LinBit |
| [5] | http://www.drbd.org/download.html |
| [6] | http://www.linux-ha.org/DRBD/HowTo/Install |
| [7] | http://svn.drbd.org/drbd/trunk/scripts/ |
| [8] | http://www.drbd.org |
| [9] | http://www.linux-ha.org/DRBD/QuickStart07 |
| [10] | http://svn.drbd.org/drbd/branches/drbd-0.7/scripts/drbd.conf |
| [11] | http://www.linux-ha.org/SplitBrain |
This information provided courtesy of the Linux-HA project at http://linux-ha.org/