This site best when viewed with a modern standards-compliant browser. We recommend Firefox Get Firefox!.

Linux-HA project logo
Providing Open Source High-Availability Software for Linux and other OSes since 1999.

USA Flag UK Flag

Japanese Flag

Homepage

About Us

Contact Us

Legal Info

How To Contribute

Security Issues

This web page is no longer maintained. Information presented here exists only to avoid breaking historical links.
The Project stays maintained, and lives on: see the Linux-HA Reference Documentation.
To get rid of this notice, you may want to browse the old wiki instead.

1 February 2010 Hearbeat 3.0.2 released see the Release Notes

18 January 2009 Pacemaker 1.0.7 released see the Release Notes

16 November 2009 LINBIT new Heartbeat Steward see the Announcement

Last site update:
2017-12-15 07:44:58

Proposal for a Smart Fencing Daemon

This is a proposal for a smart fencing daemon.

What is a smart fencing daemon?

A smart fencing daemon is one which knows all about fencing in a clustered environment, but not about much else, and provides a high-level interface to it's callers when performing a fencing operation. Below is an example of what the main fencing API for such a subsystem might look:

cl_nodefence(nodeid, fence-type)

Where nodeid is some appropriate designation of the node which is to be fenced, and fence-type is the type of node-level fencing preferred. Options would probably include something like RESET, or POWEROFF, or other similar kinds of node-fencing techniques.

LarsMarowskyBree: I think we need an asynchronous interface so the Transitioner (I presume it would directly call into this API) does not block on this operation. The interface also should be capable of accepting a list of nodeids.

AndrewBeekhof agrees.

AlanRobertson agrees with the asynchronous idea, but (particularly if the asynchronous idea is carried out), sees no particular reason to compilicate the interface further by having an array of nodes to fence.

It is an open question whether such a daemon would be appropriate to add resource-level fencing to. At the moment, it is basically a cluster-wide smart STONITH daemon.

LarsMarowskyBree: I really think that resource-level fencing should not be handled by this daemon, but either as a resource on which the to-be-fenced resource depends on (ie, a Filesystem mount depending on a SCSI reservation), or internally to the resource itself (ie, self-fencing like the IBM ServeRAID). Node-level STONITH-style fencing is sufficiently different from this.

How These Daemons Model Individual STONITH objects

The daemons themselves are able to basically run forever. They certainly can run as long as any resource can run.

These individual STONITH objects live as dual-realm beasts, resources, but yet something more. They are resources because they support the {start, stop, status, monitor, promote, demote, reload and restart operations}. It is expected (but not truly mandantory) that these operations will come through the LRM interface.

The cl_nodefence() operation will not come through the LRM, but directly into the daemon - bypassing the LRM. Since the cl_nodefence operation is typically real-time critical, and in the critical path for recovery, being able to access it directly is a good thing.

When Are These Daemons Started

These daemons should probably be started before the CRM, so it doesn't have to wait on it to register...

AndrewBeekhof: My understanding is that the start consists of telling the local fencing daemon (which is already running) that it should load plugin X with a particular configuration.

Note: The fencing daemon needs to be able to cope with multple active configurations of the same plugin. For instance there might be two controllers of the same type in the cluster.

LarsMarowskyBree agrees; the basic fencing daemon is already running, the CRM only tells it to load the appropriate plugins et cetera.

How the CRM models these daemons

The CRM would model the STONITH resources, not the smart fencing daemon.

However, the stonith resources are modelled as ClusterResourceManager/MultiStateResources, with one (or more) of them being designated as active. These ones are the only ones that actually monitor the resource. In other words, the monitor operation has very different semantics when it's active or not. A non-master node does nothing when it is given a monitor operation.

For example in this case, a STONITH network switch might be reachable from nodes A, B, and C. The CRM tells the LRM to create three STONITH devices - one on each machine. It then declares one to be master, and then issues monitor operations on them all. Only the one in master mode does anything.

How These Daemons are Normally Configured In a CRM environment

They would normally be configured as hot-standby resources (or whatever the right term is), as described above.

And, this configuration is instantiated by telling the CRM to create these resources just like normal. It then tells the LRM which then talks to the daemon with start, stop, etc. operations which are then passed to the smart fencing daemon.

LarsMarowskyBree asks: Where do we place the metadata descriptions for the various different kinds of plugins? Do we put wrappers in place (inside the resource agents) or do we extent the STONITH plugin api?

How These Daemons can be Configured in non-CRM environments

In non-CRM environments, they can be configured manually using the LRM admin tool, or something similar specifically written for the situation.

How These Daemons are Monitored

They are monitored through the LRM. The active state is used to ensure that any particular STONITH device isn't simultaneously monitored by more than the supported number of nodes.

LarsMarowskyBree just points out that the monitor operation too needs to be non-blocking, ie monitoring one plugin shouldn't interfere with the other one.

How These Daemons initialize themselves

They start up and prepare for someone to connect. Although they have two kinds of clients, they don't need two separate client/server interfaces for this...

What has to happen when a STONITH resource is created

  • A "start" operation is received from a client
  • It looks up the resource to see if it exists. If it does, it returns success immediately.
  • If not, it creates the STONITH object and registers it in it's own node-local lookup table.
  • It then queries the STONITH object using the standard STONITH API, and asks the object what it can reset. It then stores these results in a linked list of lookup tables which point back to the STONITH object which can do this. Exact data structures can vary, but they're not hard.
  • It then returns success to the create operation.

LarsMarowskyBree would like to suggest to at least think about retrieving the node list only for the active resource incarnations to avoid the blocking or contention. These could then also claim the connection and prevent others from messing with the device while we may need it any second to fence a node. If we lose too many active incarnations, promoting one of them (with the plugin already in memory etc) is going to be quite fast, and it seems conceptually cleaner than potentially having 32 nodes pound and retry a STONITH device to retrieve the node list.

What has to happen when a STONITH resource is destroyed

  • The stop operation is received from a client
  • It looks up the resource to see if it exists. If not, it succeeds immediately.
  • If so, it cleans out the local hash table results for what that STONITH object can reset
  • It then returns success

How These Daemons Perform a Node-fencing Operation

  • A cl_nodefence() operation is received by the user interface.
  • The daemon searches its local hash table to see if it can reset that node itself.
  • If it can then:
    • It sets up a timer for the operation
    • It forks a child to perform the operation
    • The child performs the operation
    • It returns success or failure through an exit code
    • The parent process waits for the child to finish or time out.
    • If it times out, it kills the child. and fails
    • If not, it returns the return code gotten by the operation.
  • If it cannot, then:
    • It sets up a timer for the operation
    • It sends a "WHOCANRESET" message to its peers with the information on what node is to be reset
    • If it gets a "YES" response from a peer, it then chooses the first one which
      • hasn't previously failed on this operation
    • The active-fencer then sends a PLEASERESET message to the selected peer, and
      • remembers who it was.
    • It then waits for a response.
    • If the response is timeout or failure, the process repeats from the WHOCANRESET step above.
    • If the response is success, it cleans up the "has failed" list and notifies the requesting client process.
    • The process fails when no nodes remain to respond, and failure is reported back
      • to the requesting client process.

How These Daemons reconfigure themselves while running

  • Resource operations {stop, start, reload} are issued
  • Reload means stop then start, effectively...

Things These Daemons Don't Need/Don't Do

  • A local copy of any other node's fencing capabilities (by virtue of the multicast discovery)
  • Membership. The only real prerequisite is trust of the other machine(s).

Diagram

A diagram of the communications paths involved in this proposal is shown below:

Open Issues

  • retries on resource creation
  • extra work because it has to handle more than one STONITH resource
  • The STONITH API ought to be name/value pair oriented like everything else... Then everything would be nicer. We need to do this.
  • After a discussion on IRC with SeanReifschneider, it seems necessary to support a node which is connected to several STONITH devices. The fencing subsystem should detect this and act accordingly.

    • AlanRobertson notes that this is covered by the original proposal above.

  • Should this interface be by node uuid? Should a by-nodename interface also be provided?