This is a proposal for a smart fencing daemon.
A smart fencing daemon is one which knows all about fencing in a clustered environment, but not about much else, and provides a high-level interface to it's callers when performing a fencing operation. Below is an example of what the main fencing API for such a subsystem might look:
cl_nodefence(nodeid, fence-type)
Where nodeid is some appropriate designation of the node which is to be fenced, and fence-type is the type of node-level fencing preferred. Options would probably include something like RESET, or POWEROFF, or other similar kinds of node-fencing techniques.
LarsMarowskyBree: I think we need an asynchronous interface so the Transitioner (I presume it would directly call into this API) does not block on this operation. The interface also should be capable of accepting a list of nodeids.
AndrewBeekhof agrees.
AlanRobertson agrees with the asynchronous idea, but (particularly if the asynchronous idea is carried out), sees no particular reason to compilicate the interface further by having an array of nodes to fence.
It is an open question whether such a daemon would be appropriate to add resource-level fencing to. At the moment, it is basically a cluster-wide smart STONITH daemon.
LarsMarowskyBree: I really think that resource-level fencing should not be handled by this daemon, but either as a resource on which the to-be-fenced resource depends on (ie, a Filesystem mount depending on a SCSI reservation), or internally to the resource itself (ie, self-fencing like the IBM ServeRAID). Node-level STONITH-style fencing is sufficiently different from this.
The daemons themselves are able to basically run forever. They certainly can run as long as any resource can run.
These individual STONITH objects live as dual-realm beasts, resources, but yet something more. They are resources because they support the {start, stop, status, monitor, promote, demote, reload and restart operations}. It is expected (but not truly mandantory) that these operations will come through the LRM interface.
The cl_nodefence() operation will not come through the LRM, but directly into the daemon - bypassing the LRM. Since the cl_nodefence operation is typically real-time critical, and in the critical path for recovery, being able to access it directly is a good thing.
These daemons should probably be started before the CRM, so it doesn't have to wait on it to register...
AndrewBeekhof: My understanding is that the start consists of telling the local fencing daemon (which is already running) that it should load plugin X with a particular configuration. Note: The fencing daemon needs to be able to cope with multple active configurations of the same plugin. For instance there might be two controllers of the same type in the cluster.
LarsMarowskyBree agrees; the basic fencing daemon is already running, the CRM only tells it to load the appropriate plugins et cetera.
The CRM would model the STONITH resources, not the smart fencing daemon.
However, the stonith resources are modelled as ClusterResourceManager/MultiStateResources, with one (or more) of them being designated as active. These ones are the only ones that actually monitor the resource. In other words, the monitor operation has very different semantics when it's active or not. A non-master node does nothing when it is given a monitor operation.
For example in this case, a STONITH network switch might be reachable from nodes A, B, and C. The CRM tells the LRM to create three STONITH devices - one on each machine. It then declares one to be master, and then issues monitor operations on them all. Only the one in master mode does anything.
They would normally be configured as hot-standby resources (or whatever the right term is), as described above.
And, this configuration is instantiated by telling the CRM to create these resources just like normal. It then tells the LRM which then talks to the daemon with start, stop, etc. operations which are then passed to the smart fencing daemon.
LarsMarowskyBree asks: Where do we place the metadata descriptions for the various different kinds of plugins? Do we put wrappers in place (inside the resource agents) or do we extent the STONITH plugin api?
In non-CRM environments, they can be configured manually using the LRM admin tool, or something similar specifically written for the situation.
They are monitored through the LRM. The active state is used to ensure that any particular STONITH device isn't simultaneously monitored by more than the supported number of nodes.
LarsMarowskyBree just points out that the monitor operation too needs to be non-blocking, ie monitoring one plugin shouldn't interfere with the other one.
They start up and prepare for someone to connect. Although they have two kinds of clients, they don't need two separate client/server interfaces for this...
LarsMarowskyBree would like to suggest to at least think about retrieving the node list only for the active resource incarnations to avoid the blocking or contention. These could then also claim the connection and prevent others from messing with the device while we may need it any second to fence a node. If we lose too many active incarnations, promoting one of them (with the plugin already in memory etc) is going to be quite fast, and it seems conceptually cleaner than potentially having 32 nodes pound and retry a STONITH device to retrieve the node list.
A diagram of the communications paths involved in this proposal is shown below:
After a discussion on IRC with SeanReifschneider, it seems necessary to support a node which is connected to several STONITH devices. The fencing subsystem should detect this and act accordingly.
AlanRobertson notes that this is covered by the original proposal above.