This would allow upgrades to the cluster without any kind of perturbation to the running cluster at all. This is not really a RollingUpgrade. AlanRobertson notes that there is no budget assigned for this feature, and classifies it as future work, that is beyond 2004. Clearly the HeartbeatProgram doesn't provide this feature. As of now, no one who uses the HeartbeatProgram has ever asked for this capability. Getting a working RollingUpgrade capability seems a higher priority.
Of course, everyone knows the LocalResourceManager will never fail core dump or anything so rude as that :-). But, perhaps someone might rudely kill it with kill -9. In this case, the ClusterResourceManager could then simply go through the ClusterInformationBase and ask it about each declared resource in turn, and do a status operation on each. At the end, all will be well, and all the information will be gotten. This same technique could easily be used for both the untimely crash and the TransparentUpgrade scenarios.
Now an argument about why this is the better solution... The ClusterResourceManager already has to handle the problem of versions of the CIB, and starting up a new version of the code with an old version of the database. So, the problem of reading a 5 or 10 year old database format with new code is something they already have to do. If this approach is chosen, the LocalResourceManager has to deal with this issue solely for TransparentUpgrades. So, now this kind of annoying and nasty problem (database versions) has to be solved in two places instead of one. If one takes the no persistent memory approach, then in the process, you get recovery from kill -9's and (heaven forbid!) crashes for no additional charge, and a lower overall development effort.
If the ClusterResourceManager can accept this implementation, then the only feature which needs to be added to the LocalResourceManager is the ability to shut down without releasing resources. We can commit to doing that in our first production release.
LarsMarowskyBree: Yes, that is a possible approach. I have to admit it may be better than the original plan. However, I'd like to point out some corner-case OpenIssues which need to be addressed with this version:
First, the ResourceInstance which is currently running could be running with different parameters than the one in the ClusterInformationBase. For example, a port number which was changed in the CIB, but which has not yet taken effect - the status operation would report that the ResourceInstance is offline or failed, but would in fact be lieing. A possible work around might be to simply disallow changes of such parameters for running resources.
Second, there is some status loss associated here. Assume for a second the following states a ResourceInstance can be in: stopped, starting, running, failed, restarting, stopping, stop failure. The last case is special, because it implies an error during the stop operation: a failure which means the resource got fatally stuck, and if we wanted to free it, we'd need to reboot the entire node, very likely. Now, the state stop failure may not be distinguished from stopped by the status operation, or at least we would need to make sure that we kept proper track of it in the ResourceAgent. init scripts might not be powerful enough to do this, though. But forgetting about a stop failure state could mess up the cluster too.
Third, one of the reasons why the LocalResourceManager was supposed to keep this state itself, and not the ClusterResourceManager in the CIB, was to meet a request by AlanRobertson: We should not assume the ClusterResourceManager to be the only client of the LocalResourceManager. This seemed to imply that the LRM needed to keep its own, independent state. If the restriction that we would only rediscover the CRM resources is OK, then we can go this route.
Fourth, with the NodeFencing proposal, we assume that the LocalResourceManager is also doing STONITH for us. The idea was that the LRM would also remember when it has (un)successfully STONITHed a another node, information which is not easily kept in the CIB. (Because the current proposal for figuring out the most recent CIB is a simple who has the higher generation counter instead of a full-fledged merge of status information from the different partitions.) In the worst case, we might forget about a STONITH operation and STONITH a node again.
AlanRobertson replies: With regard to the first issue: This is not something which the LRM is prepared to deal with. It cannot change the parameters of any running resource. From its point of view, changing the parameters to a resources is making a new resource. It is not going to do any kind of comparison between resource parameters. Therefore this is not going to be helped or hurt by this proposal. This issue is a RedHerring.
With regard to the second issue: There are only three states a resource can be in from our point of view: stopped, started, or in-transition. Any time you have a resource start fail, or a resource stop fail, nothing is reliably known any more, and you cannot start the resource again on any node until this node has been STONITHed. Doing half-measures isn't good enough, and may destroy data because the meaning of being half-started or half-stopped is undefined -- and will stay that way. This is a second RedHerring.
With regard to the third issue: Its a RedHerring too... You don't have to keep state. You just have to know the universe of possible resources. You had to know that before, and you have to know it now. That hasn't changed. Having multiple clients doesn't somehow make the LRM have to keep local state. Even if it has multiple clients, it doesn't need local state.
With regard to the fourth issue: If a STONITH operation cannot be well-modeled as resources, then that argues against modelling STONITH operations as resources, not that resources need to be more complex in order to somehow squeeze in support for this not-a-resource. However, it is worth noting that *resources* often do keep persistent state. However, I don't think that helps STONITH. Because if I issue a STONITH and it succeeds, what is the status of the STONITH resource? I think the answer to that question is "stopped" - because a STONITH operation is self-stopping. This was my concern about trying to model a STONITH operation as a resource. And, it would be made worse by trying to keep LRM-local persistent state. The LRM would not be aware of any special property of the resource, and wouldn't realize that although it had been started, it was now stopped. However, if the resource itself keeps its own persistent state (as some psuedo-resources need to do already), then the resource would be well-aware of its own special nature. However, I believe that for STONITH operations, that they would not need this because they are self-resetting, and should report "stopped" unless a power-off (not reset) STONITH was requested. Although I'm not as sure as I was for the previous issues, I believe that this fourth issue is also a RedHerring.
The first issue isn't as much of a RedHerring as it looks like Maybe I explained it wrong. Of course, we are not prepared to change the parameters of a running ResourceInstance on-the-fly. But, that's exactly the point. The database may have changed since we started it. When we inquire the status of the resource from the LRM though, we use the UUID, which is static. So it will tell us that yes, resource with UUID 38484343943 still running. But, as I said, changing the parameters of online resources is probably not the smartest idea and opens up tons of cornercases, so disallowing it seems the best answer.
The second issue: A start fail can potentially be cleaned up by a stop.
But yes, the fact that nothing is reliably known anymore is the exact point, we are not allowed to forget about these states - but, a status operation may not retrieve this failure status again. If we rely on the LRM status operation to refresh our memory, we may get into trouble here. That's why I'm saying this information needs to be persistently kept by the LRM until the next reboot of the node. A STONITH is indeed the only answer to clean that up, agreed. However, we may decide not to do that right now - the resource in question may be low-priority, while the high-priority resource is running just fine on that node. STONITHing it then wouldn't be a smart move. So, we need to remember that state.
I'm also not sure whether started, stopped or in-transition are sufficient states. But maybe this is a separate OpenIssue. I'd say that stopped, starting, started, stopping, monitor failed, restarting, stop failed with the appropriate transitions between them are needed. If the CRM sees a resource in starting stage, it can assume that - given the optimistic case - it will eventually arrive in the started stage, but has to wait until then to issue start orders for resources depending on this one.
Third: As I said, it seemed to imply; I don't believe this is a RedHerring, but a now cleared up issue
Fourth: I agree. The STONITH request results should not need to be tracked by the LRM, they probably need to be tracked by the STONITH resources themselves or in the CIB. Thanks for this clarification.
One more clarification on this though: As NodeFencing explains, we are not actually modelling a STONITH operation as a Resource; we are modelling the STONITH controller as a resource, ie the gateway through which the ClusterResourceManager routes its STONITH requests to a particular STONITH device. The requests themselves do not need to go through the LocalResourceManager, though it would make a certain amount of sense both from a design perspective and from the amount of coding necessary; the LRM is our gateway to resources, and it already has all that infrastructure in place, we are just asking it to perform one more action on a resource.
As a fifth issue, rediscovering all resource instances on a node might be a very expensive operation, depending on how many resources are defined in the cluster. This might not be a pressing issue, but if the LRM was able to recover from itself restarting or crashing fast, it certainly wouldn't hurt - but, this may be an optimization for the future.
And as a sixth one (sorry, I'm not trying to annoy you, it just occured to me), the rediscovery of resource state is not simple, in particular for init scripts. (Or at least not reliable.) We can put this down as a design limitation and that people need to fix their init scripts in that case, but we already get a bit of trouble with them being unable to implement the status operation correctly right now... Keeping track of the this is what you are supposed to have internally might make things easier for them.
As a seventh issue (sorrry!!!!), the need to serialize resource operations on a given ResourceInstance also seems to require state tracking. Imagine this: We crash after having just issued a stop operation. We restart, run status to rediscover everything and Oops, we have just stepped onto eachothers toes, because we forgot that a stop operation was still in flight. (Or crashing during a monitor operation, or whatever else.)
So, the first issue is cleared up by simply disallowing this (which makes sense), and three and four are cleared up, and five we can postpone. The sixth is probably not a real issue, but something which I wanted to mention at least.
This leaves us with two issues two / seven still open. Which I believe to actually be issues, but where I don't see the immediate answer to, besides saying that status ought to discover resources hanging around in all possible failure cases (which seems difficult and hard on the RA/init script writers), or keeping this persistent tracking ourselves. Falling back to a full rediscovery of resources using the status operation is backup plan B, but doesn't seem to be the most reliable nor speedy one. Please let me know your thoughts.
AlanRobertson says Whew! and then makes the observation that this item has morphed into tracking resource status - which I think is another major issue number. I've forgotten which one. But be that as it may...
Subitem (two): LarsMarowskyBree and I talked about this extensively this morning (2/19/2004), and came to the mutual conclusion that the actions that had to be taken when a resource start or stop failed, were cluster-wide and not local. In particular, the ClusterResourceManager has to stop all the resources which depend on the failed resource, and then remember not to start this resource on any other node in the cluster until proper recovery can occur. There are two kinds of "proper recovery": STONITH the node which failed in stop or start, or stop the resource, and if it succeeds, double check the status of the resource to see if it really stopped. Note that STONITH is risk-free (but very annoying), but that the other option depends on knowing things about how well the resoruce agent is coded. Nevertheless, since resoruce dependencies potentially span nodes, and the recovery requires policies and knowledge the LRM doesn't know anything about, it is safe to say that the ClusterResourceManager has to track this kind of occurance very specially to keep the resource from starting on another node before this potentially-extensive recovery has been completed. As a result, we agreed that since the CRM needed to track this itself, there was no point in requiring the LRM to also track it.
Subitem (seven): We agreed that this should be a rare occurance and that if it happens we can require a complete system restart (or STONITH) should this occur. Moreover, it seems like this very improbable occurance would be even less likely to actually restart and go through the process fast enough to run into the problems cited. KISS dictates either ignoring it, or restarting the computer, because the latter recovery action is very simple to implement and always works.
It is my (AlanRobertson's) belief that all the subitems of this item have been resolved for the time being. I would ask LarsMarowskyBree to move this item to the closed list if he agrees. If not, we can go on and add subitems 8, 9, and 10