This page is formatted and updated according to the IssueTrackingWikiProtocol.
21) Outstanding items regarding LRM/CRM integration
The current manual LRM testing setup is neither comprehensive nor adequate.
A suite of regression tests that verifies all facets of LRM functionality is required before it will be enabled in the CRM. Debugging the CRM alone is complex enough without trying to debug the LRM at the same time. The tests should be able to be run by others based on what is in CVS.
Off the top of my head, below are some parts of the LRM that need to be tested for each resource class and with both good and bad data/inputs where appropriate:
metadata operations (including faking metadata, the "description" field IIRC, in the case LSB RAs)
event generation (maybe just start with a fake heartbeat RA that it rigged to fail after X seconds)
monitoring (is this different from monitoring?)
doing things at the wrong time (stop a resource when its not started for example)
doing bad things (try to add duplicate resources, perform an action that isnt supported)
20) Understanding about OCF RA specification
Please refer to the file resource-agent-api.txt
* (Around line 149)
But can give more definite rules for a implementation?
* (Around line 306)
RA executor. But what values should be assigned to some variables? For example, for OCF_RESOURCE_INSTANCE, OCF_RESOURCE_TYPE OCF_RESOURCE_TYPE should be the name of the resource being invoked -- for example Filesystem, or datadisk, or LinuxSCSI, or ServeRAID -- it's normally the same as the script name. As far as OCF_RESOURCE_INSTANCE, I
think you need to ask LMB about this one... -- AlanRobertson If OCF_RESOURCE_TYPE is always the name of the resource being invoked, maybe don't need to be set it by the RA
executor. Right? -- SunJiangDong
Partially for my own clarification, examples of OCF_RESOURCE_TYPE are IPaddr and apache (as opposed to 10.0.0.1 and myWebServer). -- AndrewBeekhof
OCF_RESOURCE_INSTANCE is the configured name for the resource (eg. 10.0.0.1 and myWebServer from above). -- AndrewBeekhof
* (Around line 168)
leave the resource instance in the requested state.
What's the meaning? It means that stopping twice in a row is the same as stopping once. Starting twice in a row is the same
as starting once. Note that this is a property of the individual ResourceAgent, not of the LocalResourceManager. This means you don't have to do anything to make this happen - it's supposed to be a property of the ResourceAgent. If you wanted (for some reason) to rely on this being true, then you would be permitted to rely on this property. But, I suspect you don't care one way or the other about this property. -- AlanRobertson
I'm making OCF compliant RA based on the former heartbeat resource agent scripts, so have to care it -- SunJiangDong
19) Clarification of flush command behavior.
SunJiangDong asks: A flush command to a RA is received while the last operation of the same RA is not finished yet. How to deal with this situation? flush at once or flush until the last operation is finished? Now the choice is: execute "flush" operation until the last operation is finished.
AndrewBeekhof points out that since the LocalResourceManager is single threaded, this should not be an issue as (with the exception of monitor()) it will have always finished the current action before it gets the flush operation. If other asynchronous operations (from the point-of-view of the LocalResourceManager) are added, we should readdress this issue at that time.
AlanRobertson begs to both disagree and clarify. Although the LocalResourceManager is not threaded, it can have many operations going on at once in child processes for different ResourceInstances, and several queued for a given ResourceInstance. From the point of view of the user interface to the LocalResourceManager, almost all operations are asynchronous. So, the question is not moot. However, there are probably only two operations which can be safely interrupted. Those are status and monitor. All other interruptions risk damaging the integrity of the resource. Therefore the LRM should either not interrupt any operations at all, or only interrupt status and monitor operations.
AlanRobertson also suggests that other issues get separate issue files, so we can move this one to closed status, if there are no disagreements. The other cases where we had several sub-issues came out as a result of the flow of the discussion, not by intention from the beginning.
AndrewBeekhof, while possibly choosing a sub-optimal way to express it, agrees with AlanRobertson that the LRM should either not interrupt any operations at all, or only interrupt status and monitor operations.
18) This issue was raised by SunJiangDong. He need the clarification about the format of command-line parameters transferred to LRM from CRM. Besides, need to tranfer the environment parameters to LRM separately from command-line parameters, for example,don't merge them in one ghashtable.
AlanRobertson replies: I don't quite know what you mean by command line parameters transferred to the CRM from the LRM. Since the CRM won't start up (or invoke) the LRM, it won't be possible to pass command-line parameters to the LRM from the CRM, and the LRM won't inherit it's environment from the CRM either. Are you referring to environment parameters passed as part of the API?
AndrewBeekhof chimes in: We had this chat yesterday on IRC and I believe SunJiangDong is speaking of InstanceParameters. It was agreed that InstanceParameters would be passed to the LRM in a HashTable of the form ("param_name", "value"). The LRM, based on the type of the RA would convert this into a compatible form (ie. --param_name=value, or param_name=value as an environment variable, or any other approproate form). It was seen that this was a job for the LRM as it would require the CRM to know too much about a the internals of a resource.
HuangZhen said, flush: flush has been add to the new version of lrm_api.h, please refer to Issue 16. drain: would you like to explain why we need drain? the question: according to current design, there is no automatic flush.
AlanRobertson replies: This is exactly why an automatic flush of existing, but not executed operations might be desirable. The problem is that you only return the status of the last command, and also the next commands on the operation might not be appropriate if an operation in the queue failed first. This is the kind of thing that disk device drivers and others that queue commands do. But, no one has decided whether this should happen. This is really just a comment at this point, sort of a reminder that we need to make a decision on what the right thing to do is.
AlanRobertson continues on about drain(). This is another type of operation which is common when command queueing is present. The purpose of a drain operation is that one might want to know when all existing operations on a particular resource are completed. The drain operation completes when all the operations in the queue (regardless of how many there are and where the came from) complete. As was mentioned before, this is not something which had been decided is a requirement. It's a reminder to Andrew to decide for sure, and then if he says yes, we can go ahead and do it. It would probably be important that the drain operation not change the last command executed or last command return code.
LarsMarowskyBree adds that an automatic flush() on failure seems to be desirable. It is not mandatory though. A drain() however seems not to be needed, given that we have get_cur_state() and flush(). But it sure wouldn't hurt either, we may just not end up using it. It remains to be defined if a drain() would also cancel any pending monitor operations (as flush() likely should too)? My gut feeling is that it should.
AlanRobertson clarifies some more. A drain() operation can only be approximated by a status operation. However, it is not a perfect approximation, since it clears the last operation state. The point of a drain operation is not error recovery but determination of status, so IMHO, it should not meddle with anything else like cancel current monitor operations (like a flush would). Another way to add a drain which might be more useful would be to add an immediate/queued parameter to the get_cur_state() operation. Given that we aren't actually doing anything much to our resources, IMHO many of the other traditional operations like suspend and resuming queues, and giving immediate (unqueued) operations would be overkill here.
AndrewBeekhof chips in. I'm thinking that neither drain() nor flush() should cancel the monitor op. As I see it, the intention of both functions is to (help) determin the most up-to-date information for a resource - stopping the monitor op goes against this purpose (or at least the purpose I had in mind). However I can see that if any action fails, it would be nice if the monitor stopped so it doesnt overwrite the last action & result information.
AndrewBeekhof rambles on... as for drain() vs. flush(), I think having flush() and a get_cur_state() that had an immediate/queued parameter would do very nicely thankyou
16) LarsMarowskyBree asked AlanRobertson to design the API for the LocalResourceManager in order to flush out the remaining issues. AlanRobertson agreed. This may cause some items which were thought to be resolved to be moved back to open.
LarsMarowskyBree does like this in general, but the passing of parameters to the ResourceAgent is still missing. As OCF ResourceAgents take different parameter syntax than heartbeat resource scripts, while init scripts take no parameters at all, a single const char * parameter does not seem adequate.
15) LarsMarowskyBree: Discussion of virtual resources.
Fake resources which are auto-discovered by the nodes at bootup; ie, Connection to FC RAID backend FOOBAZ found, or I am a ia64 CPU
The ping nodes might fall into this category: Yes, I can ping them
Additional flags / fake resources which the administrator wants to associate with a node.
Some of these may be auto-discovered (such as the hardware types), or manually set by the administrator either persistently (requiring state to be kept!) or just until the next reboot.
AlanRobertson replies: The LocalResourceManager can handle any resource which has a ResourceAgent. That agent can be smart or dumb. It can be a psuedo-resource or a real resource. In heartbeat, there are already several kinds of psuedo-resources. It is usually necessary for a pseudo-resoruce to keep state. But, the kind and nature and persistency of the state depends on the resource - so the ResourceAgent for the psuedo-resource has to keep this state. In many cases real resource agents keep state too. For example, pid files, etc. This does not require the LocalResourceManager to keep persistent state of the resources - since any given resource needs to respond to stop, start and status correctly. [In particular it needs to respond to status]. The LocalResourceManager does not have any particular requirements on a resource and what operations it supports. If you aren't going to ask a resoruce to perform an unsupported operation (like start or stop), then we aren't either. Therefore, we don't care. It is worth noting that such resources are illegal resources according to the OCF resource agent definition. But, if you don't care, the LRM won't care. In fact, the LRM won't care no matter whether you care or not To get on my soapbox on a related subject: If it's not a cluster resource, then what we do to monitor it should not be cluster-specific but should work for single nodes as well. This is the recovery manager issue we've talked about before. I consider this issue resolved. If you agree, please move it to resolved issues.
The ClusterResourceManager does not track state information and the design assumes that all node state is tracked by the nodes, and the LRM does know this.
AlanRobertson does not fully understand this request. The use of the word "Tracking" tends to imply the desire is to do something with the information. Since there was no requirement specified to do anything with this data, and the LocalResourceManager is PolicyFree, it isn't obvious what the word tracking means in this context. It could mean something as simple as logging the information. If that's what's intended, I'm sure we can do that.
LarsMarowskyBree: With tracking I mean to just keep the records around until the reboot of the node. For example, we require the LRM to keep a list of active ResourceInstances (obviously). This request extends that to failed/stopped resources: even if the resource instance has been stopped on that node, the LRM can still tell me that it is stopped, or that it has failed. The LRM should remember the last known state of a ResourceInstance on that node. That the LocalResourceManager should/could also easily keep a reboot counter of the node is just a nice touch.
The rationale for this is that the LocalResourceManager is authoritive for the node status, so that the current relevent cluster status can always be easily accessed by combining the data from all LRMs, avoiding the need for truely distributed book-keeping.
AlanRobertson has no foggy idea why the LocalResourceManager should be in charge of the status of nodes. It is in charge of the status of resources, not nodes. It knows absolutely nothing about clusters or nodes at all. It knows about resources on the current machine - which from it's point of view is all there is to know about. PS: heartbeat keeps a restart count already. We can add it to the API if you want.
But with regard to "status"... The LRM can tell you the status of any resource, including ones which have never been started by it. That's inherent in the ResourceAgents. You just have to populate us with the configuration information for a resource, and we're off and running... In fact, on that subject, please see the next item...
LarsMarowskyBree: That the LocalResourceManager would track local resources (which in turn represents the full node state) was just the current assumption to work with; as a node does not have any other state except for the resources which it holds, that seemed sensible. We can discuss that, obviously, but it was the way of the original proposal - the full cluster status (wrt resources) could always be assembled by querying all wiki:LocalResourceManagers.
This would allow upgrades to the cluster without any kind of perturbation to the running cluster at all. This is not really a RollingUpgrade. AlanRobertson notes that there is no budget assigned for this feature, and classifies it as future work, that is beyond 2004. Clearly the HeartbeatProgram doesn't provide this feature. As of now, no one who uses the HeartbeatProgram has ever asked for this capability. Getting a working RollingUpgrade capability seems a higher priority.
Of course, everyone knows the LocalResourceManager will never fail core dump or anything so rude as that :-). But, perhaps someone might rudely kill it with kill -9. In this case, the ClusterResourceManager could then simply go through the ClusterInformationBase and ask it about each declared resource in turn, and do a status operation on each. At the end, all will be well, and all the information will be gotten. This same technique could easily be used for both the untimely crash and the TransparentUpgrade scenarios.
Now an argument about why this is the better solution... The ClusterResourceManager already has to handle the problem of versions of the CIB, and starting up a new version of the code with an old version of the database. So, the problem of reading a 5 or 10 year old database format with new code is something they already have to do. If this approach is chosen, the LocalResourceManager has to deal with this issue solely for TransparentUpgrades. So, now this kind of annoying and nasty problem (database versions) has to be solved in two places instead of one. If one takes the no persistent memory approach, then in the process, you get recovery from kill -9's and (heaven forbid!) crashes for no additional charge, and a lower overall development effort.
If the ClusterResourceManager can accept this implementation, then the only feature which needs to be added to the LocalResourceManager is the ability to shut down without releasing resources. We can commit to doing that in our first production release.
LarsMarowskyBree: Yes, that is a possible approach. I have to admit it may be better than the original plan. However, I'd like to point out some corner-case OpenIssues which need to be addressed with this version:
First, the ResourceInstance which is currently running could be running with different parameters than the one in the ClusterInformationBase. For example, a port number which was changed in the CIB, but which has not yet taken effect - the status operation would report that the ResourceInstance is offline or failed, but would in fact be lieing. A possible work around might be to simply disallow changes of such parameters for running resources.
Second, there is some status loss associated here. Assume for a second the following states a ResourceInstance can be in: stopped, starting, running, failed, restarting, stopping, stop failure. The last case is special, because it implies an error during the stop operation: a failure which means the resource got fatally stuck, and if we wanted to free it, we'd need to reboot the entire node, very likely. Now, the state stop failure may not be distinguished from stopped by the status operation, or at least we would need to make sure that we kept proper track of it in the ResourceAgent. init scripts might not be powerful enough to do this, though. But forgetting about a stop failure state could mess up the cluster too.
Third, one of the reasons why the LocalResourceManager was supposed to keep this state itself, and not the ClusterResourceManager in the CIB, was to meet a request by AlanRobertson: We should not assume the ClusterResourceManager to be the only client of the LocalResourceManager. This seemed to imply that the LRM needed to keep its own, independent state. If the restriction that we would only rediscover the CRM resources is OK, then we can go this route.
Fourth, with the NodeFencing proposal, we assume that the LocalResourceManager is also doing STONITH for us. The idea was that the LRM would also remember when it has (un)successfully STONITHed a another node, information which is not easily kept in the CIB. (Because the current proposal for figuring out the most recent CIB is a simple who has the higher generation counter instead of a full-fledged merge of status information from the different partitions.) In the worst case, we might forget about a STONITH operation and STONITH a node again.
AlanRobertson replies: With regard to the first issue: This is not something which the LRM is prepared to deal with. It cannot change the parameters of any running resource. From its point of view, changing the parameters to a resources is making a new resource. It is not going to do any kind of comparison between resource parameters. Therefore this is not going to be helped or hurt by this proposal. This issue is a RedHerring.
With regard to the second issue: There are only three states a resource can be in from our point of view: stopped, started, or in-transition. Any time you have a resource start fail, or a resource stop fail, nothing is reliably known any more, and you cannot start the resource again on any node until this node has been STONITHed. Doing half-measures isn't good enough, and may destroy data because the meaning of being half-started or half-stopped is undefined -- and will stay that way. This is a second RedHerring.
With regard to the third issue: Its a RedHerring too... You don't have to keep state. You just have to know the universe of possible resources. You had to know that before, and you have to know it now. That hasn't changed. Having multiple clients doesn't somehow make the LRM have to keep local state. Even if it has multiple clients, it doesn't need local state.
With regard to the fourth issue: If a STONITH operation cannot be well-modeled as resources, then that argues against modelling STONITH operations as resources, not that resources need to be more complex in order to somehow squeeze in support for this not-a-resource. However, it is worth noting that *resources* often do keep persistent state. However, I don't think that helps STONITH. Because if I issue a STONITH and it succeeds, what is the status of the STONITH resource? I think the answer to that question is "stopped" - because a STONITH operation is self-stopping. This was my concern about trying to model a STONITH operation as a resource. And, it would be made worse by trying to keep LRM-local persistent state. The LRM would not be aware of any special property of the resource, and wouldn't realize that although it had been started, it was now stopped. However, if the resource itself keeps its own persistent state (as some psuedo-resources need to do already), then the resource would be well-aware of its own special nature. However, I believe that for STONITH operations, that they would not need this because they are self-resetting, and should report "stopped" unless a power-off (not reset) STONITH was requested. Although I'm not as sure as I was for the previous issues, I believe that this fourth issue is also a RedHerring.
The first issue isn't as much of a RedHerring as it looks like Maybe I explained it wrong. Of course, we are not prepared to change the parameters of a running ResourceInstance on-the-fly. But, that's exactly the point. The database may have changed since we started it. When we inquire the status of the resource from the LRM though, we use the UUID, which is static. So it will tell us that yes, resource with UUID 38484343943 still running. But, as I said, changing the parameters of online resources is probably not the smartest idea and opens up tons of cornercases, so disallowing it seems the best answer.
The second issue: A start fail can potentially be cleaned up by a stop.
But yes, the fact that nothing is reliably known anymore is the exact point, we are not allowed to forget about these states - but, a status operation may not retrieve this failure status again. If we rely on the LRM status operation to refresh our memory, we may get into trouble here. That's why I'm saying this information needs to be persistently kept by the LRM until the next reboot of the node. A STONITH is indeed the only answer to clean that up, agreed. However, we may decide not to do that right now - the resource in question may be low-priority, while the high-priority resource is running just fine on that node. STONITHing it then wouldn't be a smart move. So, we need to remember that state.
I'm also not sure whether started, stopped or in-transition are sufficient states. But maybe this is a separate OpenIssue. I'd say that stopped, starting, started, stopping, monitor failed, restarting, stop failed with the appropriate transitions between them are needed. If the CRM sees a resource in starting stage, it can assume that - given the optimistic case - it will eventually arrive in the started stage, but has to wait until then to issue start orders for resources depending on this one.
Third: As I said, it seemed to imply; I don't believe this is a RedHerring, but a now cleared up issue
Fourth: I agree. The STONITH request results should not need to be tracked by the LRM, they probably need to be tracked by the STONITH resources themselves or in the CIB. Thanks for this clarification.
One more clarification on this though: As NodeFencing explains, we are not actually modelling a STONITH operation as a Resource; we are modelling the STONITH controller as a resource, ie the gateway through which the ClusterResourceManager routes its STONITH requests to a particular STONITH device. The requests themselves do not need to go through the LocalResourceManager, though it would make a certain amount of sense both from a design perspective and from the amount of coding necessary; the LRM is our gateway to resources, and it already has all that infrastructure in place, we are just asking it to perform one more action on a resource.
As a fifth issue, rediscovering all resource instances on a node might be a very expensive operation, depending on how many resources are defined in the cluster. This might not be a pressing issue, but if the LRM was able to recover from itself restarting or crashing fast, it certainly wouldn't hurt - but, this may be an optimization for the future.
And as a sixth one (sorry, I'm not trying to annoy you, it just occured to me), the rediscovery of resource state is not simple, in particular for init scripts. (Or at least not reliable.) We can put this down as a design limitation and that people need to fix their init scripts in that case, but we already get a bit of trouble with them being unable to implement the status operation correctly right now... Keeping track of the this is what you are supposed to have internally might make things easier for them.
As a seventh issue (sorrry!!!!), the need to serialize resource operations on a given ResourceInstance also seems to require state tracking. Imagine this: We crash after having just issued a stop operation. We restart, run status to rediscover everything and Oops, we have just stepped onto eachothers toes, because we forgot that a stop operation was still in flight. (Or crashing during a monitor operation, or whatever else.)
So, the first issue is cleared up by simply disallowing this (which makes sense), and three and four are cleared up, and five we can postpone. The sixth is probably not a real issue, but something which I wanted to mention at least.
This leaves us with two issues two / seven still open. Which I believe to actually be issues, but where I don't see the immediate answer to, besides saying that status ought to discover resources hanging around in all possible failure cases (which seems difficult and hard on the RA/init script writers), or keeping this persistent tracking ourselves. Falling back to a full rediscovery of resources using the status operation is backup plan B, but doesn't seem to be the most reliable nor speedy one. Please let me know your thoughts.
AlanRobertson says Whew! and then makes the observation that this item has morphed into tracking resource status - which I think is another major issue number. I've forgotten which one. But be that as it may...
Subitem (two): LarsMarowskyBree and I talked about this extensively this morning (2/19/2004), and came to the mutual conclusion that the actions that had to be taken when a resource start or stop failed, were cluster-wide and not local. In particular, the ClusterResourceManager has to stop all the resources which depend on the failed resource, and then remember not to start this resource on any other node in the cluster until proper recovery can occur. There are two kinds of "proper recovery": STONITH the node which failed in stop or start, or stop the resource, and if it succeeds, double check the status of the resource to see if it really stopped. Note that STONITH is risk-free (but very annoying), but that the other option depends on knowing things about how well the resoruce agent is coded. Nevertheless, since resoruce dependencies potentially span nodes, and the recovery requires policies and knowledge the LRM doesn't know anything about, it is safe to say that the ClusterResourceManager has to track this kind of occurance very specially to keep the resource from starting on another node before this potentially-extensive recovery has been completed. As a result, we agreed that since the CRM needed to track this itself, there was no point in requiring the LRM to also track it.
Subitem (seven): We agreed that this should be a rare occurance and that if it happens we can require a complete system restart (or STONITH) should this occur. Moreover, it seems like this very improbable occurance would be even less likely to actually restart and go through the process fast enough to run into the problems cited. KISS dictates either ignoring it, or restarting the computer, because the latter recovery action is very simple to implement and always works.
It is my (AlanRobertson's) belief that all the subitems of this item have been resolved for the time being. I would ask LarsMarowskyBree to move this item to the closed list if he agrees. If not, we can go on and add subitems 8, 9, and 10