It is common for businesses to configure backup sites for their business continuity plans. Then when one site goes down, the other site can take over the workload from the site which has gone down.
When you configure a Heartbeat cluster in this way with some nodes in one site and some in another site, we call it a split-site or stretch configuration.
Contents
When one has a cluster which is in a single location, it is relatively easy to create highly reliable communications between the nodes in a cluster. Combined with fencing techniques like STONITH, it is straightforward to guarantee that it is nearly impossible for a SplitBrain condition to arise.
However, in a split-site configuration, it is virtually impossible to guarantee reliable communications between the cluster nodes, and fencing techniques often are unusable in such a situation - because they also rely on reliable communications.
So in a split-site configuration, one is left with the uncomfortable situation that SplitBrain conditions can routinely arise, and there is no fencing technique available to render them harmless. Unless properly compensated for by other methods, BadThingsWillHappen to a split-site cluster.
One of the key issues to be considered in implementing SplitSite clusters is the replication of state over distance. This is an interesting and difficult problem, but is outside the scope of the Heartbeat discussion.
This will be handled by other software or hardware components. For example, DRBD or HADR could be used to replicate data by software or IBM's PPRC (is that name right?) product to replicate it at the disk hardware level. What Heartbeat will do is manage these replication services.
There are several possible variants of this problem which each lead to their own unique issues.
n-node split site clusters with servers split evenly across two sites
n-node split site clusters with servers split unevenly across the two sites
We would like to handle at least cases (a) and (b) above. Handling case (c) well would be a bonus, but isn't completely essential (case 'c' above requires changes to the structure of the hostcache file).
In the CCM, the quorum process is broken up into two different pieces, the quorum method and the tie-breaker method. Both are designed as plugins, so that a variety of different methods can be designed, and put together into different solutions for different kinds of configurations.
It is the job of the quorum plugin to decide if the cluster has quorum or not. When invoked, the quorum plugin can return any of the following possible answers:
As of this writing (2.0.5), we have implemented only one type of quorum plugin - using the classic majority vote scheme. When one subcluster has an absolute majority (> INT(n/2 nodes)), then the plugin returns HAVEQUORUM. When the subcluster has exactly half of the nodes in the cluster, it returns TIEQUORUM. When the subcluster is less than n/2 nodes, it returns "NOQUORUM".
When the Quorum plugin returns TIEQUORUM, then the tiebreaker plugin is called. It is the job of this plugin to use some method to break the tie so that it is virtually impossible for both nodes to think they have broken the tie.
There are is only one tiebreaker plugin currently available - twonode. The twonode tiebreaker breaks the tie if it is called in a two node cluster, and does not break the tie if called in a larger cluster. This is consistent with the behavior of the R1 cluster manager.
Many different types of quorum plugins are possible. The only constraint is that the combination of quorum plugin and tiebreaker plugin must be highly unlikely to grant quorum to both sides at once. In the absence of fencing mechanisms, it is also necessary to guarantee that there is sufficient time for resources to be stopped before quorum is moved from one subcluster to another.
Here are are a few possible types of quorum plugins that come to mind:
(needs hostcache file changes).
(needs hostcache file changes).
(needs hostcache file changes).
Note: human intervention may be an attractive alternative to any of these methods. The R2 CRM has this basically built in - because you can tell a subcluster to ignore quorum - which has the same effect as saying "you have quorum unconditionally"
Connect to a tiebreaker server which guarantees that it will never break the tie (grant quorum) to one more than one server at a time. This method is somewhat similar to the disk reserve operation, but is software-based, and is well-suited to a split-site configuration.
Note that with this arrangement, it is possible to lose or gain quorum without any change in membership. This is not yet supported by the R2 CCM (but it needs to be).
It might be more general to revert to only one kind of plugin - the quorum plugin. And, then one could configure an ordered set of m plugins. The result of a particular plugin would only be taken into account when all previous plugins (if any) returned QUORUMTIE.
In the end, if one of the plugins eventually returned HAVEQUORUM, when all previous plugins had returned QUORUMTIE,then quorum would be granted.
Another (maybe simpler) way of saying this is:
Note that many combinations of plugins make little or no sense.
(don't know if this is clear, or a really good idea ;-))
zhenh: My consider, it may work like this:
(This is definitely not how I think it should work. This would be (IMHO) broken. Once a node returns HAVEQUORUM, the result should be HAVEQUORUM. This is a prioritization scheme, not a voting scheme -- AlanR)
let's consider that we have a local-quorum plugin(majority), a local tie-breaker(twonodes) and a global quorum (3rd quorum).
(Twonode should never be used in a split-site situation. In fact, if we add site designations to nodes in the cluster, it should automatically disable itself).
zhenh's understanding:
Change the CCM so that plugins can set up callbacks and change their quorum calculation without any change in membership
Implement the tiebreaker server quorum tiebreaker plugin
This would allow us to handle split sites with an equal number of nodes in each site.
If somehow we get these things done, and still have time, then these enhancements are worth considering, in approximately this priority order.
Optional steps (maybe we can get gshi to do some of this work?)
Implement the site designations and node weighting updates to the hostcache file.
Disable the twonode override when multiple sites are configured.
Implement a new CCM API call to allow the list of quorum modules to be set by the CRM. This will make the configuration much more manageable.
This will allow n-site arrangements to be supported
Implement a tiebreaker module that counts ping nodes and returns TRUE if all are reachable (I know this has nothing to do with SplitSite, but it would be nice anyway).