HEAD: DRBD state machine and recovery strategies
AUTHOR: Lars Ellenberg
- This text describes how I think drbd should handle "events" and state changes. In general, error recovery is currently implemented at least similar as it is described here. Though not all of it is
implemented exactly as outlined. In the latter case, the special case needs more thought, or we need to chose which recovery strategie is best, or simply currently implement some more simple but suboptimal strategie. The main deficiency of current DRBD code is that recovery and special case handling is done in small code pieces spread all over the code. It probably would make an audit, extension or implementation change much easier to have it all in one place, which would be some sort of a state machine. The goal is to describe here how it should be, agree that what is described here is as good as we can get. And then verify the implementation against this document.
- In this context, events are administrative requests (which DRBD may refuse to handle, and therefore are allowed to fail), and failures (or self-healing, think network-hickup). Since failures are "force-majeur", the must be handled. Though this probably implies that we need some emergency catch-all handler, we prefer to have a specific recovery strategy for each possible failure case.
- the driver
- one device
- cluster manager/operator
- drbd in writable state
- drbd coordinator for two (or more, once implemented) active nodes.
- drbd in "slave" state, only mirroring/serving requests from the peer. I don't like the term "slave", and it is not strictly "passive" either...
init DRBD driver
- Done on kernel boot or module load. Do nothing but register some basic "comm channel", e.g. a character device like DM does with /dev/mapper/control ( /dev/drbd/control ? ) All further administrative requests go through this channel.
- And before someone starts "then why don't you implement DRBD as DM-target": we thought about that. Maybe it comes down to that at some point in time. But it would need some special code in DM, and it won't get DRBD much benefit. We won't need the basic device driver housekeeping stuff duplicated in DRBD. But we'd need to special case some things in DM. Think about making sure that on whatever level, if you have one DRBD target in your device, it must be the only target (otherwise it does not make sense at all); make sure that DRBD specific reuests are indeed handled by this target, and not by some layer on top of it. And ... On the other hand, keeping it as a separate device does not do any harm. DM can work with any device, you even can layer DM on top of DRBD on top of DM, and it will just work. For now, doing some things that DM does well the same way in DRBD is the way to go, and if this eventually comes to a point where it really looks like a DM target, it makes moving there easier, too.
- The driver gets a request to create a drbd. It will create the worker thread for that device, probably create the device nodes, and initialize all neccessary structures. If we chose to combine creation with configuration, then this will fake local storage and network peer configuration requests.
- To configure a drbd, it needs to be created, obviously. We can chose to have a combined create/configure thing, either as one ioctl, or implemented via user space wrappers. We want to be able to configure certain aspects of the device independently, particularly the performance attributes for the resynchronization process, the network peer and authentication credentials, and the local (data/meta-data) storage. This implies that the device may be already (partially or completely) configured when it receive such a (re)configure request, and that it needs to react differently depending on the state it currently is in. We first make sure that we know each possible state.
Internal DRBD states
- A drbd has a number of attributes. I first list all of them, then maybe simplify things again.
- generation counts
- peer state; assumtion or knowledge about
- Peer can be active, active coordinator, or just waiting for mirroring requests. As long as we are connected, we have a certain non-authoritative knowledge about the health of its data storage, too. If we cannot talk to it, the peer state is "Unknown", and we do not make any assumtions about it. To avoid internal
split-brain, there are situations where we block io and wait for the CM to confirm that the peer is fenced, after which we will mark the Peer as "Dead", and resume.
- local data storage
- attached data storage can be detached, if one chose to configure a diskless client, or if the local storage failed, and the recovery strategy for this was to detach it and continue as diskless client. BTW, a diskless client must use protocol C semantics, regardless of what is configured! During normal operation it obviously needs to be attached. Attaching the storage is typically the first configuration request.
- knowingly outdated We either are explicitly told that we are outdated (in response to some connection loss event, where the CM told us that the peer is still alive, and continues service in standalone mode. Or we assume to be outdated, if we (during the attach storage request) recognize that we crashed hard, and that before the crash we had been connected to an active node. There may be other situations where we want to assume we are outdated, but this needs more thought, and involves cluster wide recovery strategies for possible cluster wide failures.
- knowingly inconsistent Because we don't implement any data journalling, and therefore cannot do delayed mirroring or bring the sync target up-to-date by "replaying transactions", we are inconsistent after we start resynchronization, and until we successfully finish a resynchronization.
- knowingly the only good copy this is an important thing to know, certain error scenarios may chose a different strategie if this is the only remaining good copy.
- local meta-data storage
- We refuse to attach data storage if we cannot access meta-data storage. Should access to meta-data storage fail during operation, and we are currently active, we should continue as diskless client if possible. Maybe we chose to not detach the local data storage just yet, but we'd need to handle additional failures in some special way. Whether we must or only should refuse to become active if access to meta-data storage failed during operation needs further thought.
- network link(s) to peer(s)
- no network config; working standalone This is the case if it never has been configured yet, or has been explicitly requested, or in response to a
split-brain situation, e.g. if during handshake we found out that we have two active nodes that are supposed to be in _exclusive_ active mode; though one can define different recovery strategies for the latter case, too.
- trying to connect First thing we do as soon as we establish a connection is the handshake.
- handshake Authenticate to the peer, and request credential from the peer. Exchange state information. From here we can either go directly to "connected", or typically we will go to "resync". In certain special cases we may chose to go "standalone", though.
- connected The mode for normal operation. Left only due to some CM request to shut down, or because of link failures.
- resync running It is split in multiple sub-states internally, sync-target, sync-source, paused to give a higher priority sync group a performance boost, ... Being sync-target or sync-source may as well be an attribute of the data storage, but we chose to have it an attribute of the network, because it only makes sense when we can talk to our peer. From here we go to connected, unless we are interrupted by some failure and go to try-connect.
- cleanup This is an intermediate state after connection loss. Whether due to link failure or CM request does not matter. It may be split into multiple sub-states internally.
- To simplify, I think we can reduce the states to
- shaking hands
- normal operation
- active (exclusive)
- active (shared)
- active (coordinator)
- degraded operation
- unreachable peer
- unreachable local or remote data storage
- resync running
- online verification As long as we are unconfigured, we fail every IO requests. During normal operation, we just do our housekeeping stuff and pass on IO requests to local and remote data storage. During degraded operation, we maybe do some more, special, housekeeping stuff, and pass on IO requests to wherever appropriate depending on the particular degraded situation. When we enter reconfigure, shaking hands, or recovery, we suspend all new IO requests, and wait for currently pending IO requests (as far as possible: recovery might need to fake remote completion events...). Then we do all neccessary cleanup and housekeeping, update state flags, update persistant meta data, ..., and finally resume IO (when possible). The interessting part obviously is recovery. We cover that in detail below.
recovery actions on mirror node
- notify CM
- go "try-connect"
- otherwise do nothing
recovery actions on active node
- suspend IO
- notify CM
- go "try-connect"
- wait for "peer-is-fenced" (== resume) or reconnect An early reconnect triggers resume, a peer-is-fenced triggers resume, resume while not suspended is a no-op.
recovery actions of CM
- Fence (mark outdated) the mirror node, confirm fence operation to active node (tell active to resume).
... To be continued ...