The ClusterResourceManagerDaemon co-ordinates the actions of all other ClusterResourceManager subsystems on a given ClusterNode. One of these is elected as the DesignatedCoordinator based on a list of FullyConnected ClusterNodes from the ClusterConcensusMembership.
Regular instances of the ClusterResourceManagerDaemon are slaves to the DesignatedCoordinator and effectively end up as glorified routers. Passing messages to and from the DesignatedCoordinator for sub-systems such as the ClusterInformationBase and the LocalResourceManager.
The actions of the ClusterResourceManagerDaemon are controlled by a FiniteStateMachine which is described here.
1. Local Startup
Connect to the HeartbeatProgram,
Connect to the ClusterConcensusMembership,
Connect to the LocalResourceManager
Start the ClusterInformationBase
If any startup actions fail, the ClusterResourceManagerDaemon will terminate as it considers itself unhealthy. Therefore the HeartbeatProgram and the LocalResourceManager must all have been started prior to the ClusterResourceManagerDaemon. In particular the ClusterConcensusMembership daemon must not only have been started but also needs to have established itself in the cluster.
NOTE: For more information, see Starting the FirstNode.
Start DC Timer
The DC Timer tracks how long it has been since this node heard from the DesignatedCoordinator and is used to trigger elections (see below).
Wait for an event
Events include being asked to shutdown, an election being triggered (locally or by another node), or receiving communication from the DesignatedCoordinator.
At this point the node is in a pending state. Eligible to participate in the resource layer of the cluster but not yet with permission to do so. Such permission can only be granted by the DesignatedCoordinator and is done so as part of the JoinProtocol outlined below.
ClusterNode Announces Presence to the DesignatedCoordinator
This action is initiated upon receiving a DC_Heartbeat from the DesignatedCoordinator.
DesignatedCoordinator Verifies Announce Message
At this step, the DesignatedCoordinator will check if the node has established layer2 membership (Ie. is in the list supplied by the ClusterConcensusMembership) and if not, discards the announcement and aborts the join protocol for that node.
DesignatedCoordinator Issues Invitation to Join the CRM Layer (layer3)
NOTE: Upon election, the DesignatedCoordinator will also send this invitation to all ClusterNodes with layer2 membership.
ClusterNode Receives Membership Offer
ClusterNode Updates its ClusterInformationBase "status" Entry Locally
The information being updated is primarily from the LocalResourceManager about what resources are currently running under its control. However, ClusterNode's copy may contain information about nodes that the DesignatedCoordinator is not aware of yet which much be preserved.
ClusterNode Acknowledges the DesignatedCoordinator
The ClusterNode's updated status section is attached to the acknowledgement as data.
DesignatedCoordinator Receives the ClusterNode's Acknowledgement
The DesignatedCoordinator uses the status section attached to the acknowedgement to update the ClusterInformationBase but does not yet replicate the changes. Updates are not taken "verbatum" but are completed based on the timestamps of entries contained in the update.
DesignatedCoordinator Updates the ClusterNode's ClusterInformationBase Entry
The node's state in the ClusterInformationBase is updated to "active" after which it is able to be considered by the PolicyEngine when allocating resources.
DesignatedCoordinator Distributes the Updated ClusterInformationBase
The PolicyEngine is Invoked by the DesignatedCoordinator
The PolicyEngine is invoked to make sure existing resources are allocated correctly and determine if new ones are able to be started.
The election algorithm has been adapted from http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR521
Loosely known as the Bully Algorithm, its major points (with some local substitutions) are:
The DC Timer
If any node goes without receiving any DC_Heartbeats in the interval in which (currently) 4 should have been received, it means that there is no DesignatedCoordinator and an election is required.
Receiving a DC_Heartbeat from another DesignatedCoordinator
If one DesignatedCoordinator receives a heartbeat from another DesignatedCoordinator (such as after a cluster partition has healed), an election is also triggered.
User Intervention
Currently there is no facility to manually instigate an election. This may be considered at a later date depending on demand.
Nodes must satisfy the following criteria in order to be eligible to vote
Votes are sent as broadcast HeartbeatMessages to all other nodes.
The winner of the election in a PerfectWorld is the ClusterNode in the list supplied by the ClusterConcensusMembership with the lowest value of born_on. born_on was chosen as it indicates how long a ClusterNode has been in the cluster for and it is desirable that the DesignatedCoordinator should be placed on a stable ClusterNode as migrating it requires resources and time.
If there are more than one ClusterNode with the same born_on (which is likely), nodes are differentiated by their uuid. uuid was chosen as it is constant to a node and globally unique. The node with the lowest uuid (as determined by strcmp()) is deemed to be the winner.
In a LessThanPerfectWorld, the expected winner might be sick or shutting down and thus will not be able to cast their vote and claim the election.
In this case, the winner is the node that, of those that cast their vote, has the lowest born_on and uuid.
When Node1 counts a vote from Node2, the Winner Algothim is consulted to determine which of the two nodes (the current node or the node casting the vote) would win the election. If Node2 would win, then Node1 has lost the election and returns to a pending state, waiting for contact from the DesignatedCoordinator. However if Node1 would win, then it broadcasts (possibly not for the first time) its own vote. Next, Node1 then checks to see if it is the PerfectWorld election winner. If so, Node1 cancels the election and proceeds to take over the role of DesignatedCoordinator. Otherwise, Node1 must wait for the election expiration timer to go off before concluding it has won the election.
NOTE: At some point the voting process may be augmented to include information sent as part of the voting process (ie. version information or whether the node is currently a DesignatedCoordinator). Should this be the case, then a PerfectWorld election winner cannot be calculated and the winner must always wait for the election expiration timer to go off.
Start the PolicyEngine
Start the Transitioner
We are now the DesignatedCoordinator Further actions are:
Instigate ClusterNode discovery
This is achieved by (re)-initiating the join protocol for every node with layer 2 membership. See Join Protocol above for details.
Resource Allocation - Invoking the PolicyEngine
Once the JoinProtocol has completed for all layer2 nodes, the PolicyEngine is invoked to determine the next stable state of the cluster. We wait for all nodes to join to allow the cluster to stabilize as much as is possible before the expensive process of reallocating resources is initiated. Since not all layer2 nodes may be able to be promoted to layer3, a timeout exists which initiates the resource allocation process.
Stop sending DC_Heartbeats
This will lead to one of the other nodes triggering an election. Eventually the DC may be able to trigger an election itself. However at the moment it can't do that without voting and if it votes then it is more than likely to win this election for the same reasons it won last time.
Stop the PolicyEngine
Stop the Transitioner
Wait for a new DesignatedCoordinator to be elected
NOTE: See Stopping the LastNode for variations on this sequence.
Request the DesignatedCoordinator shut us down
Let the DesignatedCoordinator / PolicyEngine / Transitioner do their work
Disconnect from the LocalResourceManager
Disconnect from the ClusterConcensusMembership
Stop the ClusterInformationBase and write it out to disk.
NOTE: Might require a third expected state "shut_me_down" so that the PE doesnt just STONITH it.
If any stage of the shutdown process stalls, an escalation process is initiated.
The escalation process is two-phased.
Phase 1 is to re-issue the request to the DesignatedCoordinator.
NOTE: Currently only phase 2 is implemented and it is broken as the timers are not working as expected.
This case is defined as trying to start a ClusterNode when no other ClusterNodes have layer2 membership or better (as determined from consulting the ClusterConcensusMembership list).
Starting the FirstNode will take longer than following nodes for two reasons.
Waiting for ClusterConcensusMembership
Currently connections to the ClusterConcensusManager will fail (it appears) until it has had a chance to establish a view of the cluster. Ie. timed out all the other nodes. Because of this, it would not be unusual for Heartbeat to respawn the ClusterResourceManagerDaemon a number of times before startup is successful.
This is not desirable behaviour and is likely to be addressed soon.
Waiting for the DesignatedCoordinator
The FirstNode currently must wait for the "DesignatedCoordinator timeout" before it realises that it there is no-one in charge. This could be avoided by initiating an election straight away if there is only one node listed in the ClusterConcensusMembership list. However this is currently not a high priority.
Slave Nodes
Slave nodes have no tasks to perform except to move to the S_PENDING state and await contact from the DesignatedCoordinator.
Previous DesignatedCoordinator Nodes
Any nodes that, prior to the election were the DesignatedCoordinator but are not anymore have a number of tasks to perform.
Release DesignatedCoordinator Status
This should be done as soon as possible to avoid having multiple nodes responding DC requests.
Stop the PolicyEngine
Stop the Transitioner
NOTE: It is unclear to me right now if this completely is safe. It would seem to be since:
As a result of step 1, any replies to outstanding requests will only be processed by the new DesignatedCoordinator.
It is unlikely any further actions should be instigated by an old DesignatedCoordinator.
This case is defined as trying to shutdown a ClusterNode when no other ClusterNodes have layer2 membership or better (as determined from consulting the ClusterConcensusMembership list).
If we are the last active node, then there is little point starting the escalation process to and we know ahead of time that there is no-one to take over as DesignatedCoordinator. In this case, once we have reached the S_PENDING state we should continue shutting down by:
Disconnect from the LocalResourceManager
Disconnect from the ClusterConcensusMembership
Stop the ClusterInformationBase and write it out to disk
Immediately after two or more cluster partitions heal together there will be more than one DesignatedCoordinator. This will be detected within the period of one DC_Heartbeat. The case where one DesignatedCoordinator receives a DC_Heartbeat from another DesignatedCoordinator is a trigger for an election after which only one ClusterNode will remain the DesignatedCoordinator.
See Losing an Election for information on the actions of losing nodes.
When a Cluster Enters a new epoch (ClusterNodes joining or leaving, elections, ...) there may be a number of requests in-transit or pending replies.
NOTE: This section is still, more than most sections anyway, a work in progress.
Starting Facts
The order of messages from NodeA to NodeB and from NodeB to NodeA can NOT be guarenteed
The internal operations in the ClusterResourceManager are "push" rather than "pull"
The DesignatedCoordinator in epoch X may not be the DesignatedCoordinator in epoch X + 1
Inferred Facts
All requests for epoch X from the DesignatedCoordinator will be received before a new epoch is declared
Responses to requests made (to or from the DesignatedCoordinator) during epoch X may be received during epoch X + 1
Design
The current epoch is broadcast via DC_Heartbeat messages
Requests sent to the DesignatedCoordinator should be kept for possible resending (because the DesignatedCoordinator may change)
Otherwise, it is possible that the old node will crown itself DesignatedCoordinator. Thought: Does this open us up to a possible DoS attack? I.e constantly in an election.
Consequences
The DesignatedCoordinator must keep track of which epoch its slave nodes are up to
Replies from newer epochs should be discarded by the DesignatedCoordinator
Replies from newer epochs from the DesignatedCoordinator should be processed as normal by slave nodes
After an election of a new DesignatedCoordinator (as apposed to a re-election of a DesignatedCoordinator), NodeA's "current" epoch should be 0. Thought: Does this open us up to a possible replay attack?
The following outlines why these possible replies would not be of interest to a newly elected DesignatedCoordinator:
lrm_op
These can safely be ignored as any update to resources will be discovered during the JoinProtocol when the slave node updates it's status section.
shutdown
These can safely be ignored as the node's expected state would already have been updated to "down" and we will receive notice of its successful completion from the CCM. QUESTION: Should/Will there be a "shutdown" that would only shutdown the ClusterResourceManagerDaemon and NOT the HeartbeatProgram as well? If so, perhaps these answers should not be ignored.
update,replace,delete
We're about to overwrite 90% of their copy of the ClusterInformationBase and any problems in the status section will become evident during the JoinProtocol
query
If we are interested in finding something out, we will ask.
Questions