In Versions prior to 2.0.5 there is no way to do this.
Subsequent versions can use the resource-failure-stickiness property of a resource (or the global default: default-resource-failure-stickiness) to control when failover happens.
In a cluster where:
default-resource-failure-stickiness is -100
default-resource-stickiness is 500
Resource my_rsc prefers to run on nodeA with score 1500
Resource my_rsc prefers to run on nodeB with score 1000
my_rsc can fail up to ten times on nodeA before being moved to nodeB.
(nodeA score - nodeB score + stickiness) / abs(failure stickiness) ==> (1500 - 1000 + 500) / abs(-100) ==> 1000 / 100 ==> Answer: 10
However if default-resource-failure-stickiness was set to -1001 or less, it would be moved immediately.
(nodeA score - nodeB score + stickiness) / abs(failure stickiness) ==> (1500 - 1000 + 500) / abs(-1001) ==> 1000 / 1001 ==> Answer: 0.999
NOTE: Failure stickiness should be a negative value.
NOTE: The rules for +/-INFINITY apply...
INFINITY +/- -INFINITY : -INFINITY INFINITY +/- int : INFINITY -INFINITY +/- int : -INFINITY
If the combined score for my_rsc on a node is less than zero, it will never be able to run there again until the failure count is reset.
The current failure count (for a given resource and node) is multiplied by the resource's failure stickiness to produce a failover score. When the failover score exceeds the regular preference to a given node, the node will be excluded from running the resource again (until the failure count is reset).
In the first example above (where default-resource-failure_stickiness is -100) my_rsc can fail up to 15 times on nodeA before Heartbeat will no longer consider running it there. Likewise it can fail up to 10 times on nodeB before that node will no longer be consider either.
In the second example (where default-resource-failure-stickiness is less than -500), these values drop to 3 and 2 failures respectively.
To reset the failure count for my_rsc on nodeA, the following command can be used:
crm_failcount -D -U nodeA -r my_rsc
To query the current failure count, use:
crm_failcount -G -U nodeA -r my_rsc
The failure count is not automatically reset when a resource is moved to prevent the resource from bouncing between two or more "bad" nodes.
When you use groups or other things with mandatory colocation constraints, it gets more complicated. If you take everything which you force to be colocated on the same machine, then all those things will be treated just as though they were in a group together. So, keep that in mind in the explanation below.
For a group with n resources, the resource-stickiness of the group of n resources is the sum of all the stickinesses in the group. Unless you've overridden the default for the resources in the group, that would be n times the default-resource-stickiness, and it is the stickiness of the group as a whole which is used, rather than the stickiness of any member of the group. The resource-failure-stickiness value for a group is computed in a similar way.
In addition, if you want to give a different resource-stickiness or resource-failure-stickiness to every member in the group, you can give this attribute to the group itself. However, it will effectively be multiplied by n in the process of summing up all the stickiness values of the group members, because this value is inherited by each primitive resource in the group, and then subjected to the summing process described above.
You probably want every resource in the group to score the same, so that computing all these score differences for failures is simpler. A simple way to do this is to put your locational constraints on your groups rather than on individual resources in the group. Note that unlike the stickiness values, the scores for locational constraints on groups are not multiplied by n - so to make things balance out correctly, you may need to do this yourself when you create the corresponding CIB.
You cannot force a group to fail over if any given resource in the group fails 'n' times. Failure counts for groups are cumulative. If you set it up so that 3 failures will cause a failover, then 3 failures of any resource in the group will cause a failover, not 3 for a particular resource. If you have a web server, an IP address and a mount point, then it will fail over with 3 web server failures, or one failure of each resource in the group.
The outline of the requirements to be met are:
One group using pingd for pinging a gateway
Don't run on a node which has failed 3 times without resetting failure counts (resource-failure-stickiness)
Run on a node with ping access according to pingd
Stay on the node you're on if it has ping access (resource-stickiness)
To be supplied