This site best when viewed with a modern standards-compliant browser. We recommend Firefox Get Firefox!.

Linux-HA project logo
Providing Open Source High-Availability Software for Linux and other OSes since 1999.

USA Flag UK Flag

Japanese Flag

ホームページ

サイトについて

コンタクト情報

使用条件

協力方法

セキュリティ

This web page is no longer maintained. Information presented here exists only to avoid breaking historical links.
The Project stays maintained, and lives on: see the Linux-HA Reference Documentation.
To get rid of this notice, you may want to browse the old wiki instead.

2010.1.28
追加パッケージ集リニューアル
追加パッケージ集は、こちらから

2008.8.28
RHEL用rpm更新
更新情報はこちらから

2008.8.18
Heartbeat 2.1.4
リリース!
Downloadはこちらから

2007.11.13
Linux-ha-japan日本語ML移植しました

2007.10.5
日本語サイトOPEN
日本語MLも開設しました

2007.10.5
OSC2007 Tokyo/Fall で Heartbeat紹介
発表資料を公開しました

Last site update:
2017-12-11 04:47:19

DRBD leads to highly available data using affordable commodity hardware

Every service depends on some data.
 DRBD makes your data highly available using
 commodity hardware components.

 by Lars Ellenberg

Contents

  1. Data Redundancy By DRBD
    1. Why you want to have Data Redundancy
    2. Real Time Backup with Replication
    3. Installation with binary packages
    4. Installation from source
    5. Configuration
    6. Troubleshooting
    7. Testing
    8. Unattended Mode
  2. Some Do's and Don'ts
    1. Disaster Recovery with "Tele-DRBD"
  3. drbd.conf details
  4. Technical Details
    1. How it works
    2. When synchronization is needed
      1. Case 1: Secondary fails
      2. Case 2: Primary fails
      3. Double Failure


Data Redundancy By DRBD

drbd_ha

  • Has your database (or mail or file) server crashed? Is your entire department waiting for you to restore service? Are your most recent backups a month old? Are those backups off-site? Is this a frighteningly real scenario? Oh, yeah. Can it be avoided? Oh, yeah.

    The Distributed Replicated Block Device (DRBD) system can save the day, your data, and your job. DRBD provides data redundancy at a fraction of the cost of other solutions.

Why you want to have Data Redundancy

  • Almost every service depends on data. So, to offer a service, its data must be available. And if you want to make that service highly-available, you must first make the data it depends on highly-available. The most natural way to do this (and hopefully, it's something you already do on a regular basis) is to backup your data. In case you lose your active data, you just restore it from the most recent backup, and the data is available again. Or, if the host your service runs on is (temporarily) unusable, you can replace it with another host configured to provide the identical service, and restore the data there. To reduce possible downtime, you can have a second machine ready to takeover. Whenever you change the data on one machine, you back it up on the other. You can have the secondary machine switched off, and just turn it on if the primary host goes down. This is typically referred to as cold standby. Or you can have the backup machine up and running, a configuration known as a hot standby. However, whether your standby is hot or cold, one problem remains: if the active node fails, you lose changes to the data made after the most recent backup. But even that can be addressed... if you have the bucks. One solution is to use some kind of shared storage device. With media shared between machines, both nodes have access to the most recent data when they need it. Shared storage can be simple SCSI sharing, dual controller RAID arrangements like IBM's ServeRAID, shared fiber-channel disks, or high-end storage like IBM Shark or the various EMC solutions. While effective, these systems are relatively costly, ranging from five thousand to millions of dollars. And unless you purchase the most expensive of these systems, shared storage systems typically have single points of failure (SPOFs) associated with them -- whether they're obvious or not. For example, some systems provide separate paths to a single shared bus, but have a single, internal electrical path to access the bus. Another solution -- and one that's as good as the most expensive hardware -- is live replication.

Real Time Backup with Replication

  • DRBD provides live replication of data. DRBD provides a mass storage device, such as a block device, and distributes the device over two machines. Whenever one node writes to the distributed device, the changes are replicated to the other node in real time. DRBD layers transparently over any standard block device (the "lower level device"), and uses TCP/IP over standard network interfaces for data replication. Though you can use raw devices for special purposes, the typical direct client to a block device is a filesystem, and it's recommended that you use one of the journaling filesystems, such as Ext3 or Reiserfs. (XFS is not yet usable with DRBD.) You can think of DRBD as RAID1 over the network.
    • {OK} Typical resynchronization time after connection loss or crash is independent of total storage size, but a function of the (configurable!) active set size. We always only resynchronize intelligently those regions that have actually been modified.

    No special hardware is required, though it's best to have a dedicated (crossover) network link for the data replication. And if you need high write throughput, you should eliminate the bottleneck of 10/100 megabit Ethernet and use Gigabit Ethernet instead. (To tune it further, you can increase the MTU to something greater than the typical files system block size, say, 5000 bytes). Thus, for the cost of a single, proprietary shared storage solution, you can setup several DRBD clusters; and even support the further development

    of it for at least one year ;) Though you can use raw devices for special purposes, the typical direct client to a block device is a filesystem. It is recommended to use one of the journaling filesystems, i.e. ext3 or reiserfs, or xfs if you like.

Installation with binary packages

  • When there are (official or unofficial) packages available for your favorite distribution, then you can just install these, and you're done.

    LinBit provides binary packages for support customers for most "Enterprise" distributions, or for any distribution upon request.

    See also: http://www.drbd.org/download.html SuSE officially does include drbd and heartbeat in its standard distributions, as well as in its fully supported SuSE Linux Enterprise Sever (SLES) 9. The most recent "unofficial" SuSE packages can be found in Lars Marowsky-Brée's subtree: ftp.suse.com/pub/people/lmb/drbd and its mirrors. For Debian users, thanks to David Krovich, the currently best resource is probably:

      deb http://fsrc.csee.wvu.edu/debian/apt-repository binary/
      deb-src http://fsrc.csee.wvu.edu/debian/apt-repository source/

Installation from source

Configuration

  • Now you need to tell DRBD about its environment. You should find a sample configuration file in /etc/drbd.conf,

    if not, there is a well commented one in the drbd/scripts subdirectory.

    This configuration file divides into at most one global{} section, and "arbitrary" many resource [resource id] {} sections, where [resource id] is typically something like drbd2 or r1, but may be any valid identifier (alphanum string). In the global section, you can specify how many drbds you want to

    be able to configure minor-count, in case you want to define more resources later without reloading the module (which would interrupt services).

    Each resource{} section further splits into resource settings partially grouped as startup{}, disk{}, net{} and syncer{} specific, and node specific settings, which are grouped in on [hostname] {} subsections.

    Parameters you need to change are hostname, drbd device, the lower level physical disk to use, the meta-disk (and index, if not internal), and Internet address and port number. For further details refer to "drbd.conf details" below.

    Note that you must not ever access the lower level device while you are using drbd. You do not mount the lower level device any longer, you mount the virtual drb-device!

Troubleshooting

  • If you have any troubles setting up DRBD, check http://www.drbd.org, and if that does not help, feel free to subscribe and ask questions on the mailing list. If you feel that write throughput is way too low, try to identify the bottleneck. Sustained write throughput cannot be better than the minimum of your underlying disk hardware and network throughputs. Make sure you enabled DMA mode for IDE disks (hdparm -d1 /dev/hdX). Note

    that network bandwidth is typically given as bits, not bytes, so 100MBit FastEthernet has a maximum bandwidth of 12.5MB/s, and that's without the protocol overhead, one way, and with only your data on the wire. For short, synchronous writes, it is typically not bandwidth, but latency, which kills your performance, because local disk, network, and remote disk latencies add to each other.

Testing

Unattended Mode

  • Up to now we only replicate the data. If one node fails, we need manual intervention. To automate this, you want to have a cluster manager disk and executable monitoring (daemon) process running: heartbeat ... Typical haresources line might be:
       paul IPaddr::192.168.99.99/24/eth0 drbddisk::r0 \
            Filesystem::/dev/drbd0::/mnt/ha0::ext3 smbd 
    Now you can bring down for maintenance the PDC of your Win Net (a SAMBA server, of course), or your main web, database or file server, without anyone noticing it, since it was HA clustered using heartbeat and DRBD...

Some Do's and Don'ts

  • Do not attempt to mount a drbd in Secondary state. On 2.6 kernels, we don't allow it. Though (on 2.4 kernels) it is still possible to mount a Secondary device readonly, changes

    made to the Primary are mirrored to it underneath the filesystem and buffer-cache of the Secondary, so you won't see changes on the Secondary. And changing meta-data underneath a filesystem is a risky habit, since it may confuse your kernel to death. So don't do that.

    Symptoms would be loads of Assert (mdev->state == Primary) in syslog. However, work is underway to support true shared disk semantics for use with cluster aware file systems such as GFS.

    Sponsors please contact office@linbit.com ...

  • Once you setup DRBD, never -- as in never!! -- bypass it, or access the underlying device directly, unless it is the last chance to recover data after some worst case event. If you for some reason need to start a cluster in degraded mode,

    do so with the drbd start and drbddisk start commands, then use the services as normal. To make sure the first sync is in the direction you expect after you've rebuilt the other node, make sure that your good copy is in Primary state.

    If necessary, you can say drbdadm invalidate on the bad copy.

  • DRBD on top of loop device, or vice versa, is expected to deadlock, so don't do it. It might work on 2.6 kernels, but you have to try it yourself.
  • You can stack DRBD on top of md, md on top of DRBD is nonsense, however.
  • DRBD on top of LVM2 is possible, but you have to be careful about when and which LVM2 features you use, and how you do it, otherwise what actually happens will not necessarily match your expectations. Snapshots for example won't know how to notify the filesystem (possibly on the remote node) to flush its journal to disk to make the snapshot consistent (which is less an issue now that there are snapshots that can be mounted rw, so you can replay a journal in the snapshot...). Drbd as LVM2 "physical" volumes does work, but you should know what you are doing ...
  • If you are considering stacking DRBD on top of DRBD, think it over again. In a fail-over case this will cause you more trouble than without it.

    Rather contact office@linbit.com and explain your needs, they will find a solution for you.

Disaster Recovery with "Tele-DRBD"

  • The typical use of DRBD and HA clustering is probably two machines connected with normal networks, and one or more crossover cables, a few meters apart, within one server room, or at least within the same building. Possibly even a few hundred meters apart, in the next building. But you could use DRBD over long distance links, too. When you have the replica several hundred kilometers away in some other data center for Disaster Recovery, your data will survive even a major

    earthquake at your primary location. You want to use protocol A and a huge sndbuf-size here, and probably adjust the timeout, too.

    Think about privacy! Since with DRBD the complete disk content goes over the wire, if this wire is not a crossover cable but the (supposedly hostile) Internet, you should route DRBD traffic through some virtual private network (VPN).

    Make sure no one other than the partner node can access the DRBD ports, or someone might provoke a connection loss, and then race for the first reconnect, to get a full sync of your disk's content.

drbd.conf details

  •     resource r0 {
            on hostname1 {
              device /dev/drbd0;
              disk /dev/sda3;
              meta-disk internal;
              address 192.168.77.1:7788;
            }
            on hostname2 {
              device /dev/drbd0;
              disk /dev/hdc;
              meta-disk internal;
              address 192.168.77.2:7788;
            }
        }
      
  • hostname

    should match exactly what uname -n reports on the respective nodes, case is significant.

    device

    the device node to use, typically /dev/drbd# . Obviously needs to be unique wit^in the configuration.

    disk
    the actual physical (lower level) device to use
    meta-disk

    meta-disk is either internal or /dev/ice/name [idx] You can use a single block device to store meta-data of multiple DRBD's.

    E.g. use meta-disk /dev/hde6[0]; and meta-disk /dev/hde6[1]; for two different resources. In this case, hde6 would need to be at least 256 MB in size.

    internal means, that the last 128 MB (aligned to 4K) of the lower device are used to store the meta-data. You must not give an index with internal.

    address
    the inet address and port to bind to locally, or to connect to on the partner node. If you use a dedicated crossover link, then this is typically a private address from a different address space -- otherwise you might run into routing problems. This should not be confused with the administration address of the node, nor with the (typically virtual) service address of the cluster.
    protocol
    transfer protocol to use.
    A
    for high latency networks. Write IO is reported as completed, if it has reached local disk and

    local TCP send buffer. (see also sndbuf-size in the net{} section) As this violates O_SYNC semantics even more than B, this will lose transactions on fail-over!

    B
    for lower-risk scenarios. Write IO is reported as completed, if it has reached local disk and remote buffer cache, thus no guarantees can be made whether the filesystem can recover.
    C
    for most cases, preserves transactional semantics. Write IO is reported as completed, if we know it has

    reached both local and remote disk.

    For further details, please refer to the example drbd.conf,

    or to man drbd.conf (which is hopefully up-to-date and correct). If you cannot find it, you can view it as drbd.conf in the online subversion repository.

Technical Details

How it works

  • Whenever a higher level application, typically a journalled file system, issues a IO request, the kernel dispatches this request based on the target device major/minor numbers. If DRBD is registered for this major number, it passes READ requests down the stack to the lower level device locally. WRITE requests are passed down the stack, too, but additionally they are sent over to the partner node. Every time something changes on the local disk, the same changes are done at the same offset on the partner node's device. If some WRITE request is finished locally, a "write barrier" is sent over to the partner, to make sure that it is finished before another request comes in. Since later WRITE requests might depend on successful finished previous ones, this is needed to assure strict write ordering on both nodes. Thus with protocol C it is guaranteed that after an (f)sync operation both devices are bit-for-bit identical. Just for the blocks affected by the fsync(), of course.

When synchronization is needed

  • The most important decision that DRBD has to make is: to decide when does it need a synchronization, and does it have to be a full synchronization or just an incremental one. To make this decision possible, DRBD keeps several event and generation counters in its meta-data. Let's have a look at the failure cases. Say Paul is our primary

    server, and Silas is standby. In the normal state, Paul and Silas are up and running. If one of them is down, the cluster is degraded. If both only believe the other node is dead, this is split-brain -- Heartbeat tries to avoid this by using as many communication paths as possible.

    Typical state changes are degraded -> normal and normal -> degraded.

Case 1: Secondary fails

  • When Silas is standby and leaves the cluster (for whatever reason: network, power, hardware failure), this is not a real problem, as long as Paul keeps on running. In degraded mode Paul flags all the blocks that have write operations as dirty. Some technician comes by, fixes it, and Silas joins the cluster again. Now Silas needs all the changes made on Paul since Silas left the cluster. Since Paul has its "block is dirty" flags, it can do an incremental synchronization. If Paul failed (or was shut down) while he was alone, the dirty flags are still in the meta data, and for in flight operations we have the activity log. So still, the next time both nodes see each other, they will know which parts of the disk are clean, and they can restrict the resync to those regions which were active at the moment of the crash (and therefore are not known to be clean or dirty), or which are known to be dirty because of the dirty bitmap.

Case 2: Primary fails

  • When Paul fails as active primary node, the situation is a bit different. If Silas remains standby (unlikely; heartbeat should make it active), and later Paul comes back, Paul will become the sync source and just resync to Silas those regions that were active at the moment of the crash, since, it is unknown which blocks in this region might have been modified on Paul just before the failure, but had not reached Silas because of the crash. Since you can configure the size of the active region (activity

    log extents, al-extents in the syncer{} section), independently of the actual storage size, this limits the typical resync time to "(active region size)/(resync bandwidth)", regardless of the storage size. In the more likely case that Silas took over the active role, when Paul comes back, he will become the sync target, this time receiving the previously active regions, plus those blocks that have been modified on Silas while he was alone. If both nodes were down (main power failure or something), after the cluster reboot, the situation is similar: resync of the previously active regions from Paul to Silas. Now it seems like whenever one node was down we need to resynchronize. This is not exactly true. You can stop the services on Paul, unmount the drbd, and make it Secondary. The cluster is then connected, but both nodes are passive/standby. You can now either shutdown both nodes cleanly in any order. When they see each other the next time, there will be no sync at all, since they know from their meta-data, that both have been Secondary last time, and they belong to the same "generation", thus the data is still identical. Or you can assign Silas the active role now, make drbd Primary on Silas, mount it, and start the services. This way you can bring down Paul for

    maintenance. (hb_standby should do this for you, too.)

Double Failure

  • If one of the nodes (or the network) fails during a synchronization, this is a double failure, since the first failure caused the sync to happen. Note that double failures are logically impossible to tolerate with double redundancy, so you should treat any failure in the HA cluster as very serious and repair it ASAP, to regain redundancy once more. Paul is active and has the good data. Silas receives the sync. The cluster is still degraded, since Silas is not yet ready for takeover, it has inconsistent, only partially up-to-date, data. When Silas fails, and later comes back, the sync will resume where it left off. When Paul fails while being the sync source, we now have a non operational cluster. Silas cannot take over, since it still has inconsistent data. Paul is dead.

    If you really need availability, and don't care about possibly inconsistent, out-of-date data, you can tell Silas to become Primary anyways. It will refuse to become Primary at first, but with the explicit operator override drbdsetup /dev/drbd0 primary --do-what-I-say you can force it to. Since you used brute force, you take the blame.