Linux-HA Logo

How Do I Force a Resource to be Migrated After a Failure?

In Versions prior to 2.0.5 there is no way to do this.

Subsequent versions can use the resource-failure-stickiness property of a resource (or the global default: default-resource-failure-stickiness) to control when failover happens.

Example

In a cluster where:

my_rsc can fail up to ten times on nodeA before being moved to nodeB.

(nodeA score - nodeB score + stickiness) / abs(failure stickiness)

==> (1500 - 1000 + 500) / abs(-100)

==> 1000 / 100

==> Answer: 10

However if default-resource-failure-stickiness was set to -1001 or less, it would be moved immediately.

(nodeA score - nodeB score + stickiness) / abs(failure stickiness)

==> (1500 - 1000 + 500) / abs(-1001)

==> 1000 / 1001

==> Answer: 0.999

NOTE: Failure stickiness should be a negative value.
NOTE: The rules for +/-INFINITY apply...

 INFINITY +/- -INFINITY : -INFINITY
 INFINITY +/-  int      :  INFINITY
-INFINITY +/-  int      : -INFINITY

Multiple Failures

If the combined score for my_rsc on a node is less than zero, it will never be able to run there again until the failure count is reset.

The current failure count (for a given resource and node) is multiplied by the resource's failure stickiness to produce a failover score. When the failover score exceeds the regular preference to a given node, the node will be excluded from running the resource again (until the failure count is reset).

In the first example above (where default-resource-failure_stickiness is -100) my_rsc can fail up to 15 times on nodeA before Heartbeat will no longer consider running it there. Likewise it can fail up to 10 times on nodeB before that node will no longer be consider either.

In the second example (where default-resource-failure-stickiness is less than -500), these values drop to 3 and 2 failures respectively.

Resetting Failure Counts

To reset the failure count for my_rsc on nodeA, the following command can be used:

To query the current failure count, use:

Why the Failure Count is not Automatically Reset When a Resource is Moved

The failure count is not automatically reset when a resource is moved to prevent the resource from bouncing between two or more "bad" nodes.

How this works out with groups

When you use groups or other things with mandatory colocation constraints, it gets more complicated. If you take everything which you force to be colocated on the same machine, then all those things will be treated just as though they were in a group together. So, keep that in mind in the explanation below.

For a group with n resources, the resource-stickiness of the group of n resources is the sum of all the stickinesses in the group. Unless you've overridden the default for the resources in the group, that would be n times the default-resource-stickiness, and it is the stickiness of the group as a whole which is used, rather than the stickiness of any member of the group. The resource-failure-stickiness value for a group is computed in a similar way.

In addition, if you want to give a different resource-stickiness or resource-failure-stickiness to every member in the group, you can give this attribute to the group itself. However, it will effectively be multiplied by n in the process of summing up all the stickiness values of the group members, because this value is inherited by each primitive resource in the group, and then subjected to the summing process described above.

Things to keep in mind

What you cannot do

You cannot force a group to fail over if any given resource in the group fails 'n' times. Failure counts for groups are cumulative. If you set it up so that 3 failures will cause a failover, then 3 failures of any resource in the group will cause a failover, not 3 for a particular resource. If you have a web server, an IP address and a mount point, then it will fail over with 3 web server failures, or one failure of each resource in the group.

A worked example involving a Group

Problem Definition

The outline of the requirements to be met are:

Problem Solution

To be supplied


References

[1]http://www.linux-ha.org/pingd


This information provided courtesy of the Linux-HA project at http://linux-ha.org/