CTS is an automated random test suite for Linux-HA (heartbeat).
It is a key part of the Linux-HA test plan. It is normally run for a minimum of 500 iterations. Full major release tests commonly run the suite for 5000 or more iterations. Usually found in /usr/lib/heartbeat/cts/.
CTS' basic strategy is simple: beat the software to death. Such testing has sometimes been called Bamm-Bamm testing.
CTS runs a sequence of tests, and validates each for correct operation individually.
The following steps are followed for each test performed:
For maximum effectiveness, our software is largely instrumented to audit itself for internal consistency, and all inconsistencies discovered result in ERROR: messages. All ERROR: messages are flagged automatically by the software as "something bad" as noted above.
For this reason, the choice of whether a message should be an ERROR or a warning is largely dictated by our testing strategy.
The combination of instrumentation, choice of ERROR messages and the unrelenting nature of the CTS tests results in an extremely effective testing methodology. This is a classic example of the whole being more than the sum of the parts.
Each test in the test suite is described in the following sections.
Find a node in the cluster. If it's up, bring it down. If it's down bring it up.
Find a node in the cluster and crash it ungracefully. This is the non-CRM version of the test.
Find a node in the cluster and crash it ungracefully using the StonithDaemon. This is the CRM version of the test.
Find a node in the cluster, and stop then restart heartbeat.
Stop all nodes in the cluster, and start them all simultaneously.
Start all nodes in the cluster, and stop them all simultaneously.
Stop all nodes in the cluster, and start them all one at a time in a random order.
Start all nodes in the cluster, and stop them all one at a time in a random order.
Start all nodes in the cluster, and restart them one at a time in a random order.
Find a node in the cluster, and put it into standby mode. There are CRM and non-CRM versions of this test - since the two versions have pretty different semantics.
Kill heartbeat processes on a node ungracefully, and measure how long it takes for the failure to be detected.
Determine how much bandwidth heartbeat is consuming.
Create a split-brain condition in heartbeat, and see if it recovers correctly.
This test kills communication through a single path and sees if heartbeat withstands this. This requires that you configure multiple communication paths in your test systems. If you do not, then it will not be run.
This tests DRBD to see if it is maintaining proper integrity of the disk data. This requires that you configure DRBD into your configuration. If you do not, then it will not be run.
This CRM-only test will stop a resource watch the system recover from this resource failure.
This test will kill a process in the system and then watch the system recover from the death of a single process.
This test has been proven to cause problems with the CRM. It does the following sequence of things:
It was created by discovering this sequence when it occurred randomly in other tests, tended to cause certain kinds of failures repeatedly. So we made it its own special test. It has continued to demonstrate pre-release problems from time to time.
The near quorum point test tries to bring the cluster to near the quorum (half-up/half-down) point. For each node it decides if it should be up or down, then simultaneously it brings nodes up or down to make the decided-upon state. This tends to make it bounce up and down over the point of having quorum a few times very rapidly. It also tends to counter the bias the tests have of keeping most nodes up most of the time.
2004/04/12_13:57:19 Random seed is: (184, 160, 216) 2004/04/12_13:57:19 >>>>>>>>>>>>>>>> BEGINNING 500 TESTS 2004/04/12_13:57:19 HA configuration directory: /etc/ha.d 2004/04/12_13:57:19 System log files: /var/log/ha-log-local7 2004/04/12_13:57:19 Enable Stonith: 0 2004/04/12_13:57:19 Enable Standby: 1 2004/04/12_13:57:19 Resource Monitoring is disabled 2004/04/12_13:57:19 Cluster nodes: ['sgi1', 'sgi2'] 2004/04/12_13:57:20 Stopping Cluster Manager on all nodes 2004/04/12_13:57:23 Starting Cluster Manager on all nodes 2004/04/12_13:57:56 Running test Restart (sgi2) [1] 2004/04/12_13:58:34 Running test DetectionTime (sgi2) [2] 2004/04/12_13:58:38 ...failure detection time: 560 ms 2004/04/12_13:58:41 Running test standby (sgi2) [3] 2004/04/12_13:58:44 Running test Bandwidth (sgi2) [4] 2004/04/12_13:58:50 ...heartbeat bandwidth: 33364 bits/sec 2004/04/12_13:58:52 Running test SimulStart (sgi1) [5] 2004/04/12_13:59:12 Running test Restart (sgi1) [6] 2004/04/12_13:59:50 Running test Split_brain (sgi2) [7] 2004/04/12_14:00:12 Running test flip (sgi2) [8] 2004/04/12_14:00:48 Running test standby (sgi1) [9] 2004/04/12_14:01:00 Running test DetectionTime (sgi1) [10] 2004/04/12_14:01:01 Running test Restart (sgi1) [11] 2004/04/12_14:01:59 Running test flip (sgi1) [12] 2004/04/12_14:02:42 Running test Split_brain (sgi2) [13] 2004/04/12_14:03:35 Running test Restart (sgi2) [14] 2004/04/12_14:04:14 Running test DetectionTime (sgi2) [15] 2004/04/12_14:04:16 ...failure detection time: 270 ms 2004/04/12_14:04:20 Running test DetectionTime (sgi2) [16] 2004/04/12_14:04:23 ...failure detection time: 300 ms 2004/04/12_14:04:26 Running test Split_brain (sgi1) [17] 2004/04/12_14:04:49 Running test flip (sgi1) [18] 2004/04/12_14:05:24 Running test DetectionTime (sgi2) [19] 2004/04/12_14:05:25 Running test SimulStart (sgi1) [20] 2004/04/12_14:05:42 Running test Restart (sgi1) [21] 2004/04/12_14:06:20 Running test Bandwidth (sgi2) [22] 2004/04/12_14:06:25 ...heartbeat bandwidth: 34427 bits/sec 2004/04/12_14:06:28 Running test standby (sgi1) [23] 2004/04/12_14:06:31 Running test Restart (sgi1) [24] 2004/04/12_14:07:10 Running test Bandwidth (sgi2) [25] 2004/04/12_14:07:18 ...heartbeat bandwidth: 20305 bits/sec 2004/04/12_14:07:21 Running test DetectionTime (sgi1) [26] 2004/04/12_14:07:24 ...failure detection time: 320 ms 2004/04/12_14:07:28 Running test Split_brain (sgi1) [27] 2004/04/12_14:07:49 Running test DetectionTime (sgi1) [28] 2004/04/12_14:07:53 ...failure detection time: 300 ms 2004/04/12_14:07:57 Running test Bandwidth (sgi1) [29] 2004/04/12_14:08:03 ...heartbeat bandwidth: 29263 bits/sec 2004/04/12_14:08:05 Running test Restart (sgi2) [30] 2004/04/12_14:08:43 Running test DetectionTime (sgi1) [31] 2004/04/12_14:08:47 ...failure detection time: 310 ms 2004/04/12_14:08:51 Running test standby (sgi1) [32] 2004/04/12_14:08:54 Running test Split_brain (sgi1) [33] 2004/04/12_14:09:17 Running test DetectionTime (sgi2) [34] 2004/04/12_14:09:20 ...failure detection time: 460 ms 2004/04/12_14:09:24 Running test standby (sgi1) [35] 2004/04/12_14:09:28 Running test DetectionTime (sgi1) [36] 2004/04/12_14:09:31 ...failure detection time: 300 ms 2004/04/12_14:09:35 Running test standby (sgi2) [37] 2004/04/12_14:09:39 Running test Restart (sgi2) [38] 2004/04/12_14:10:17 Running test standby (sgi1) [39] 2004/04/12_14:10:21 Running test SimulStart (sgi2) [40] 2004/04/12_14:10:41 Running test Restart (sgi2) [41] 2004/04/12_14:11:20 Running test Bandwidth (sgi1) [42] 2004/04/12_14:11:26 ...heartbeat bandwidth: 28571 bits/sec 2004/04/12_14:11:28 Running test Restart (sgi1) [43] 2004/04/12_14:12:06 Running test flip (sgi2) [44] 2004/04/12_14:12:42 Running test Bandwidth (sgi1) [45] 2004/04/12_14:12:53 ...heartbeat bandwidth: 16695 bits/sec 2004/04/12_14:12:54 Running test standby (sgi1) [46] 2004/04/12_14:13:06 Running test SimulStart (sgi1) [47] 2004/04/12_14:13:23 Running test standby (sgi2) [48] 2004/04/12_14:13:26 Running test standby (sgi1) [49] 2004/04/12_14:13:30 Running test Bandwidth (sgi1) [50] 2004/04/12_14:13:36 ...heartbeat bandwidth: 28523 bits/sec
Output deleted...
2004/04/12_16:39:38 Running test Split_brain (sgi1) [499] 2004/04/12_16:39:59 Running test Bandwidth (sgi1) [500] 2004/04/12_16:40:05 ...heartbeat bandwidth: 28524 bits/sec 2004/04/12_16:40:07 Stopping Cluster Manager on all nodes 2004/04/12_16:40:12 **************** 2004/04/12_16:40:12 Overall Results:{'failure': 0, 'success': 500, 'BadNews': 0} 2004/04/12_16:40:12 **************** 2004/04/12_16:40:12 Detailed Results 2004/04/12_16:40:12 Test Split_brain:{'elapsed_time': 1570.0471291542053, 'skipped': 0, 'calls': 77, 'success': 77, 'auditfail': 0, 'failure': 0, 'max_time': 51.453474998474121, 'min_time': 18.692641019821167} 2004/04/12_16:40:12 Test standby:{'elapsed_time': 126.56771874427795, 'skipped': 6, 'calls': 57, 'success': 51, 'nostandby': 5, 'standby': 46, 'auditfail': 0, 'failure': 0, 'max_time': 10.525829076766968, 'min_time': 6.5088272094726562e-05} 2004/04/12_16:40:12 Test flip:{'elapsed_time': 2388.8293540477753, 'skipped': 0, 'calls': 75, 'success': 75, 'started': 8, 'down->up': 8, 'auditfail': 0, 'failure': 0, 'stopped': 67, 'max_time': 42.44762396812439, 'min_time': 1.6613788604736328, 'up->down': 67} 2004/04/12_16:40:12 Test SimulStart:{'elapsed_time': 1235.1379368305206, 'skipped': 0, 'calls': 73, 'success': 73, 'stops': 121, 'auditfail': 0, 'failure': 0, 'max_time': 23.456228971481323, 'min_time': 12.120557069778442} 2004/04/12_16:40:12 Test Bandwidth:{'elapsed_time': 581.98309683799744, 'skipped': 1, 'calls': 83, 'success': 82, 'min': 16471.653680795403, 'max': 181804.02544776717, 'totalbandwidth': 2737540.0476286379, 'auditfail': 0, 'failure': 0, 'max_time': 20.161562919616699, 'min_time': 7.7009201049804688e-05} 2004/04/12_16:40:12 Test DetectionTime:{'totaltime': 15.28000000026077, 'elapsed_time': 331.31979942321777, 'skipped': 14, 'calls': 70, 'success': 56, 'min': 0.020000000018626451, 'max': 0.57000000029802322, 'auditfail': 0, 'failure': 0, 'max_time': 21.377086877822876, 'min_time': 8.4161758422851562e-05} 2004/04/12_16:40:12 Test Restart:{'elapsed_time': 2566.4030539989471, 'skipped': 0, 'node:sgi2': 35, 'calls': 65, 'success': 65, 'node:sgi1': 30, 'WasStopped': 5, 'auditfail': 0, 'failure': 0, 'max_time': 87.125192880630493, 'min_time': 36.072878122329712} 2004/04/12_16:40:12 <<<<<<<<<<<<<<<< TESTS COMPLETED
BasicSanityCheck, SyslogNgConfiguration
I need to supply more information here eventually -- AlanRobertson