This site best when viewed with a modern standards-compliant browser. We recommend Firefox Get Firefox!.

Linux-HA project logo
Providing Open Source High-Availability Software for Linux and other OSes since 1999.

USA Flag UK Flag

Japanese Flag

Homepage

About Us

Contact Us

Legal Info

How To Contribute

Security Issues

21 December 2007 Heartbeat release 2.1.3 is now out Download it and install it!

11 October 2007 NEW educational HA/DR Blog hosted by Alan Robertson

9 April 2007 Check out the Cool Heartbeat Screencasts: Installation, Intro to the GUI Part of the Heartbeat Education project

Last site update:
2008-07-04 17:24:58

How Do I Force a Resource to be Migrated After a Failure?

In Versions prior to 2.0.5 there is no way to do this.

Subsequent versions can use the resource-failure-stickiness property of a resource (or the global default: default-resource-failure-stickiness) to control when failover happens.

Example

In a cluster where:

  • default-resource-failure-stickiness is -100

  • default-resource-stickiness is 500

  • Resource my_rsc prefers to run on nodeA with score 1500

  • Resource my_rsc prefers to run on nodeB with score 1000

my_rsc can fail up to ten times on nodeA before being moved to nodeB.

(nodeA score - nodeB score + stickiness) / abs(failure stickiness)

==> (1500 - 1000 + 500) / abs(-100)

==> 1000 / 100

==> Answer: 10

However if default-resource-failure-stickiness was set to -1001 or less, it would be moved immediately.

(nodeA score - nodeB score + stickiness) / abs(failure stickiness)

==> (1500 - 1000 + 500) / abs(-1001)

==> 1000 / 1001

==> Answer: 0.999

NOTE: Failure stickiness should be a negative value.
NOTE: The rules for +/-INFINITY apply...

 INFINITY +/- -INFINITY : -INFINITY
 INFINITY +/-  int      :  INFINITY
-INFINITY +/-  int      : -INFINITY

Multiple Failures

If the combined score for my_rsc on a node is less than zero, it will never be able to run there again until the failure count is reset.

The current failure count (for a given resource and node) is multiplied by the resource's failure stickiness to produce a failover score. When the failover score exceeds the regular preference to a given node, the node will be excluded from running the resource again (until the failure count is reset).

In the first example above (where default-resource-failure_stickiness is -100) my_rsc can fail up to 15 times on nodeA before Heartbeat will no longer consider running it there. Likewise it can fail up to 10 times on nodeB before that node will no longer be consider either.

In the second example (where default-resource-failure-stickiness is less than -500), these values drop to 3 and 2 failures respectively.

Resetting Failure Counts

To reset the failure count for my_rsc on nodeA, the following command can be used:

  • crm_failcount -D -U nodeA -r my_rsc

To query the current failure count, use:

  • crm_failcount -G -U nodeA -r my_rsc

Why the Failure Count is not Automatically Reset When a Resource is Moved

The failure count is not automatically reset when a resource is moved to prevent the resource from bouncing between two or more "bad" nodes.

How this works out with groups

When you use groups or other things with mandatory colocation constraints, it gets more complicated. If you take everything which you force to be colocated on the same machine, then all those things will be treated just as though they were in a group together. So, keep that in mind in the explanation below.

For a group with n resources, the resource-stickiness of the group of n resources is the sum of all the stickinesses in the group. Unless you've overridden the default for the resources in the group, that would be n times the default-resource-stickiness, and it is the stickiness of the group as a whole which is used, rather than the stickiness of any member of the group. The resource-failure-stickiness value for a group is computed in a similar way.

In addition, if you want to give a different resource-stickiness or resource-failure-stickiness to every member in the group, you can give this attribute to the group itself. However, it will effectively be multiplied by n in the process of summing up all the stickiness values of the group members, because this value is inherited by each primitive resource in the group, and then subjected to the summing process described above.

Things to keep in mind

  • You probably want every resource in the group to score the same, so that computing all these score differences for failures is simpler. A simple way to do this is to put your locational constraints on your groups rather than on individual resources in the group. Note that unlike the stickiness values, the scores for locational constraints on groups are not multiplied by n - so to make things balance out correctly, you may need to do this yourself when you create the corresponding CIB.

  • If you want to reset the failure counts so that things can fail back, you probably want to reset the failure counts of every resource in the group, since you don't know which one failed without searching the logs (see item below).

What you cannot do

You cannot force a group to fail over if any given resource in the group fails 'n' times. Failure counts for groups are cumulative. If you set it up so that 3 failures will cause a failover, then 3 failures of any resource in the group will cause a failover, not 3 for a particular resource. If you have a web server, an IP address and a mount point, then it will fail over with 3 web server failures, or one failure of each resource in the group.

A worked example involving a Group

Problem Definition

The outline of the requirements to be met are:

  • Two-node cluster
  • One group using pingd for pinging a gateway

  • Fail over the group on the third failure of any resource in the group
  • One node is preferred over the other
  • These location criteria are listed in order of decreasing importance:
    1. Don't run on a node which has failed 3 times without resetting failure counts (resource-failure-stickiness)

    2. Run on a node with ping access according to pingd

    3. Stay on the node you're on if it has ping access (resource-stickiness)

    4. Run the resource group on the designated (preferred) cluster node

Problem Solution

To be supplied