This site best when viewed with a modern standards-compliant browser. We recommend Firefox Get Firefox!.

Linux-HA project logo
Providing Open Source High-Availability Software for Linux and other OSes since 1999.

USA Flag UK Flag

Japanese Flag

Homepage

About Us

Contact Us

Legal Info

How To Contribute

Security Issues

This web page is no longer maintained. Information presented here exists only to avoid breaking historical links.
The Project stays maintained, and lives on: see the Linux-HA Reference Documentation.
To get rid of this notice, you may want to browse the old wiki instead.

1 February 2010 Hearbeat 3.0.2 released see the Release Notes

18 January 2009 Pacemaker 1.0.7 released see the Release Notes

16 November 2009 LINBIT new Heartbeat Steward see the Announcement

Last site update:
2017-11-24 22:16:42

CTS - Cluster Test Suite

CTS is an automated random test suite for Linux-HA (heartbeat).

It is a key part of the Linux-HA test plan. It is normally run for a minimum of 500 iterations. Full major release tests commonly run the suite for 5000 or more iterations. Usually found in /usr/lib/heartbeat/cts/.

CTS' basic strategy is simple: beat the software to death. Such testing has sometimes been called Bamm-Bamm testing.

General Methodology

CTS runs a sequence of tests, and validates each for correct operation individually.

The following steps are followed for each test performed:

  1. Each test exectuted, and each is validated for correct execution by examining the system logs of the systems under test, and verifying that the correct output is produced in each case.
  2. The logs are examined for any messages which contain output indicating something bad happened are also flagged and treated as errors.
  3. At the end of each test, the state of the cluster is examined for "sanity". The following conditions are validated at the end of each test iteration:
    • Each resource is running exactly once in the cluster
    • Every resource in a resource group is checked to see that they're all on the same server
    • Each resource is checked for "correct operation".

For maximum effectiveness, our software is largely instrumented to audit itself for internal consistency, and all inconsistencies discovered result in ERROR: messages. All ERROR: messages are flagged automatically by the software as "something bad" as noted above.

For this reason, the choice of whether a message should be an ERROR or a warning is largely dictated by our testing strategy.

The combination of instrumentation, choice of ERROR messages and the unrelenting nature of the CTS tests results in an extremely effective testing methodology. This is a classic example of the whole being more than the sum of the parts.

Test Descriptions

Each test in the test suite is described in the following sections.

Flip

Find a node in the cluster. If it's up, bring it down. If it's down bring it up.

STONITH

Find a node in the cluster and crash it ungracefully. This is the non-CRM version of the test.

StonithD

Find a node in the cluster and crash it ungracefully using the StonithDaemon. This is the CRM version of the test.

Restart

Find a node in the cluster, and stop then restart heartbeat.

SimulStart

Stop all nodes in the cluster, and start them all simultaneously.

SimulStop

Start all nodes in the cluster, and stop them all simultaneously.

StartOneByOne

Stop all nodes in the cluster, and start them all one at a time in a random order.

StopOneByOne

Start all nodes in the cluster, and stop them all one at a time in a random order.

RestartOneByOne

Start all nodes in the cluster, and restart them one at a time in a random order.

StandbyTest

Find a node in the cluster, and put it into standby mode. There are CRM and non-CRM versions of this test - since the two versions have pretty different semantics.

FastDetection

Kill heartbeat processes on a node ungracefully, and measure how long it takes for the failure to be detected.

Bandwidth

Determine how much bandwidth heartbeat is consuming.

SplitBrain

Create a split-brain condition in heartbeat, and see if it recovers correctly.

Redundant Path

This test kills communication through a single path and sees if heartbeat withstands this. This requires that you configure multiple communication paths in your test systems. If you do not, then it will not be run.

DRBD

This tests DRBD to see if it is maintaining proper integrity of the disk data. This requires that you configure DRBD into your configuration. If you do not, then it will not be run.

Resource Recover

This CRM-only test will stop a resource watch the system recover from this resource failure.

ComponentFail

This test will kill a process in the system and then watch the system recover from the death of a single process.

Special Test 1

This test has been proven to cause problems with the CRM. It does the following sequence of things:

  • stop all nodes
  • start one node
  • start all other nodes at once

It was created by discovering this sequence when it occurred randomly in other tests, tended to cause certain kinds of failures repeatedly. So we made it its own special test. It has continued to demonstrate pre-release problems from time to time.

Near Quorum Point

The near quorum point test tries to bring the cluster to near the quorum (half-up/half-down) point. For each node it decides if it should be up or down, then simultaneously it brings nodes up or down to make the decided-upon state. This tends to make it bounce up and down over the point of having quorum a few times very rapidly. It also tends to counter the bias the tests have of keeping most nodes up most of the time.

Sample Output

2004/04/12_13:57:19     Random seed is: (184, 160, 216)
2004/04/12_13:57:19     >>>>>>>>>>>>>>>> BEGINNING 500 TESTS
2004/04/12_13:57:19     HA configuration directory: /etc/ha.d
2004/04/12_13:57:19     System log files: /var/log/ha-log-local7
2004/04/12_13:57:19     Enable Stonith: 0
2004/04/12_13:57:19     Enable Standby: 1
2004/04/12_13:57:19     Resource Monitoring is disabled
2004/04/12_13:57:19     Cluster nodes: ['sgi1', 'sgi2']
2004/04/12_13:57:20     Stopping Cluster Manager on all nodes
2004/04/12_13:57:23     Starting Cluster Manager on all nodes
2004/04/12_13:57:56     Running test Restart (sgi2)     [1]
2004/04/12_13:58:34     Running test DetectionTime (sgi2)       [2]
2004/04/12_13:58:38     ...failure detection time: 560 ms
2004/04/12_13:58:41     Running test standby (sgi2)     [3]
2004/04/12_13:58:44     Running test Bandwidth (sgi2)   [4]
2004/04/12_13:58:50     ...heartbeat bandwidth: 33364 bits/sec
2004/04/12_13:58:52     Running test SimulStart (sgi1)  [5]
2004/04/12_13:59:12     Running test Restart (sgi1)     [6]
2004/04/12_13:59:50     Running test Split_brain (sgi2) [7]
2004/04/12_14:00:12     Running test flip (sgi2)        [8]
2004/04/12_14:00:48     Running test standby (sgi1)     [9]
2004/04/12_14:01:00     Running test DetectionTime (sgi1)       [10]
2004/04/12_14:01:01     Running test Restart (sgi1)     [11]
2004/04/12_14:01:59     Running test flip (sgi1)        [12]
2004/04/12_14:02:42     Running test Split_brain (sgi2) [13]
2004/04/12_14:03:35     Running test Restart (sgi2)     [14]
2004/04/12_14:04:14     Running test DetectionTime (sgi2)       [15]
2004/04/12_14:04:16     ...failure detection time: 270 ms
2004/04/12_14:04:20     Running test DetectionTime (sgi2)       [16]
2004/04/12_14:04:23     ...failure detection time: 300 ms
2004/04/12_14:04:26     Running test Split_brain (sgi1) [17]
2004/04/12_14:04:49     Running test flip (sgi1)        [18]
2004/04/12_14:05:24     Running test DetectionTime (sgi2)       [19]
2004/04/12_14:05:25     Running test SimulStart (sgi1)  [20]
2004/04/12_14:05:42     Running test Restart (sgi1)     [21]
2004/04/12_14:06:20     Running test Bandwidth (sgi2)   [22]
2004/04/12_14:06:25     ...heartbeat bandwidth: 34427 bits/sec
2004/04/12_14:06:28     Running test standby (sgi1)     [23]
2004/04/12_14:06:31     Running test Restart (sgi1)     [24]
2004/04/12_14:07:10     Running test Bandwidth (sgi2)   [25]
2004/04/12_14:07:18     ...heartbeat bandwidth: 20305 bits/sec
2004/04/12_14:07:21     Running test DetectionTime (sgi1)       [26]
2004/04/12_14:07:24     ...failure detection time: 320 ms
2004/04/12_14:07:28     Running test Split_brain (sgi1) [27]
2004/04/12_14:07:49     Running test DetectionTime (sgi1)       [28]
2004/04/12_14:07:53     ...failure detection time: 300 ms
2004/04/12_14:07:57     Running test Bandwidth (sgi1)   [29]
2004/04/12_14:08:03     ...heartbeat bandwidth: 29263 bits/sec
2004/04/12_14:08:05     Running test Restart (sgi2)     [30]
2004/04/12_14:08:43     Running test DetectionTime (sgi1)       [31]
2004/04/12_14:08:47     ...failure detection time: 310 ms
2004/04/12_14:08:51     Running test standby (sgi1)     [32]
2004/04/12_14:08:54     Running test Split_brain (sgi1) [33]
2004/04/12_14:09:17     Running test DetectionTime (sgi2)       [34]
2004/04/12_14:09:20     ...failure detection time: 460 ms
2004/04/12_14:09:24     Running test standby (sgi1)     [35]
2004/04/12_14:09:28     Running test DetectionTime (sgi1)       [36]
2004/04/12_14:09:31     ...failure detection time: 300 ms
2004/04/12_14:09:35     Running test standby (sgi2)     [37]
2004/04/12_14:09:39     Running test Restart (sgi2)     [38]
2004/04/12_14:10:17     Running test standby (sgi1)     [39]
2004/04/12_14:10:21     Running test SimulStart (sgi2)  [40]
2004/04/12_14:10:41     Running test Restart (sgi2)     [41]
2004/04/12_14:11:20     Running test Bandwidth (sgi1)   [42]
2004/04/12_14:11:26     ...heartbeat bandwidth: 28571 bits/sec
2004/04/12_14:11:28     Running test Restart (sgi1)     [43]
2004/04/12_14:12:06     Running test flip (sgi2)        [44]
2004/04/12_14:12:42     Running test Bandwidth (sgi1)   [45]
2004/04/12_14:12:53     ...heartbeat bandwidth: 16695 bits/sec
2004/04/12_14:12:54     Running test standby (sgi1)     [46]
2004/04/12_14:13:06     Running test SimulStart (sgi1)  [47]
2004/04/12_14:13:23     Running test standby (sgi2)     [48]
2004/04/12_14:13:26     Running test standby (sgi1)     [49]
2004/04/12_14:13:30     Running test Bandwidth (sgi1)   [50]
2004/04/12_14:13:36     ...heartbeat bandwidth: 28523 bits/sec

Output deleted...

2004/04/12_16:39:38     Running test Split_brain (sgi1) [499]
2004/04/12_16:39:59     Running test Bandwidth (sgi1)   [500]
2004/04/12_16:40:05     ...heartbeat bandwidth: 28524 bits/sec
2004/04/12_16:40:07     Stopping Cluster Manager on all nodes
2004/04/12_16:40:12     ****************
2004/04/12_16:40:12     Overall Results:{'failure': 0, 'success': 500, 'BadNews': 0}
2004/04/12_16:40:12     ****************
2004/04/12_16:40:12     Detailed Results
2004/04/12_16:40:12     Test Split_brain:{'elapsed_time': 1570.0471291542053, 'skipped': 0,
         'calls': 77, 'success': 77, 'auditfail': 0, 'failure': 0,
         'max_time': 51.453474998474121, 'min_time': 18.692641019821167}
2004/04/12_16:40:12     Test standby:{'elapsed_time': 126.56771874427795, 'skipped': 6,
         'calls': 57, 'success': 51, 'nostandby': 5, 'standby': 46, 'auditfail': 0,
         'failure': 0, 'max_time': 10.525829076766968, 'min_time': 6.5088272094726562e-05}
2004/04/12_16:40:12     Test flip:{'elapsed_time': 2388.8293540477753, 'skipped': 0,
         'calls': 75, 'success': 75, 'started': 8, 'down->up': 8, 'auditfail': 0,
         'failure': 0, 'stopped': 67, 'max_time': 42.44762396812439,
         'min_time': 1.6613788604736328,  'up->down': 67}
2004/04/12_16:40:12     Test SimulStart:{'elapsed_time': 1235.1379368305206, 'skipped': 0,
         'calls': 73, 'success': 73, 'stops': 121, 'auditfail': 0, 'failure': 0, 
         'max_time': 23.456228971481323, 'min_time': 12.120557069778442}
2004/04/12_16:40:12     Test Bandwidth:{'elapsed_time': 581.98309683799744, 'skipped': 1,
         'calls': 83, 'success': 82, 'min': 16471.653680795403, 'max': 181804.02544776717,
         'totalbandwidth': 2737540.0476286379, 'auditfail': 0, 'failure': 0,
         'max_time': 20.161562919616699, 'min_time': 7.7009201049804688e-05}
2004/04/12_16:40:12     Test DetectionTime:{'totaltime': 15.28000000026077,
         'elapsed_time': 331.31979942321777, 'skipped': 14, 'calls': 70, 'success': 56,
         'min': 0.020000000018626451, 'max': 0.57000000029802322, 'auditfail': 0,
         'failure': 0, 'max_time': 21.377086877822876, 'min_time': 8.4161758422851562e-05}
2004/04/12_16:40:12     Test Restart:{'elapsed_time': 2566.4030539989471, 'skipped': 0,
         'node:sgi2': 35, 'calls': 65, 'success': 65, 'node:sgi1': 30, 'WasStopped': 5,
         'auditfail': 0, 'failure': 0, 'max_time': 87.125192880630493,
         'min_time': 36.072878122329712}
2004/04/12_16:40:12     <<<<<<<<<<<<<<<< TESTS COMPLETED

See Also

BasicSanityCheck, SyslogNgConfiguration

Caveats

I need to supply more information here eventually ;-) -- AlanRobertson