This Page is work in progress. Feel free to add missing things, or correct me where I'm wrong, or clarify...
The linux-ha project claims to add high availability to your services, just by using it (the intended way), possibly in conjunction with some related projects.
To verify that it works as intended, we need to test our systems. We have a number of subsystems of varying complexity, more or less tightly interacting with each other.
We want to be able to
audit our managed services for proper operation at any time
verify integrity of important things at any time
We ignore the subsystem tests for now (but have a look at the BasicSanityCheck), and concentrate on the "system as a whole" tests.
I apologize that the presented hardware abstractions are linux-centric. This is just because I know it best. And guess it is the OS of choice for most deployments that use this project, anyways.
We expect to have a TestingEnvironment, where one non-cluster machine runs the test, and is in full and exclusive control over the to-be-tested set of cluster nodes, and possibly some other machines as well (which could represent ping nodes or clients, or whatever we may come up with later). Lets call this controlling and monitoring box (or equivalently the
controlling software) the Exerciser. Obviously "full and exclusive control" implies that the Exerciser can issue arbitrary commands as root on all nodes. === Mini Howtos ===
Internal or white-box service/resource audit can be done by just calling the <RA> status operation (as appropriate for the resource agent class), which basically is: execute the resource agent scripts on every node with the status query. With the CRM running, we should query the CRM on each node for the CIB, and verify that it is all the same as on the master CRM (DC), and matches our own findings from the direct status operations; and our overall expectations, of course.
External black box service verification. From the test master (or from a farm of clients controlled by it) issue regular client requests to the cluster services. Trigger faults (as above) and verify that the cluster recovers the service, and measure the time and/or percentage of failed requests. LarsMarowskyBree would really love this feature, it's the most important testing capability still missing in his opinion... These application level audits could be done by scripts in some app-level-audit and/or simul-client-requests directory, which should be named after the respective resource agent scripts.
fail and heal. Since hardware failures are "force-majeur", this fail can happen anytime. Some hardware failures may be "self-healing" under certain circumstances (e.g. temporary network failure). Most hardware failure in the real world involve operator intervention to be solved. So the Exerciser is free to "decide" after a time, that a certain failure has been resolved by some technician, and then tell our abstraction object to heal again. ==== Node ====
Simulating a node failure is relatively easy: halt -nf This has a drawback: the node would then remain dead. So we rather want to reboot it. But here we have to be careful, Since the Exerciser wants to decide when exactly a node is "healed" again.
The nodes need to come up in a certain READY state, which makes it possible for the Exerciser to issue commands on them -- setup our various hardware abstraction objects, or be "healed" now, and (re)join the cluster. but for the other nodes, they should remain "invisible" or DEAD... This basically just boils down to not starting the cluster-manager after reboot, but only on request from the Exerciser. So we now simulate
a node failure by reboot -nf (the -nf means non-flushing, forced; which basically should be identical to hitting the reset button). We of course could use a STONITH device, too, but they are supposedly controlled by our cluster manager, and that might interfere with it. ==== Storage ==== In linux, we can abstract block devices with the linux device mapper. It remaps IO-requests to some underlying device(s) following some target mapping scheme. The two most simple targets of it are "linear" and "error". It allows to change the target mapping at runtime. So a working block device will just be mapped transparently linear. When the Exerciser decides that it failed, it will remap it, or parts of it, to the error target. The next IO-request on that device will then fail with an IO-error, and we expect that this is recognize somewhere, and appropriate action is taken (maybe the node panics, reboots, hangs itself, whatever). ==== Network Links ==== In linux, we can use a special catch-all iptables rule as the first rule in all available tables, and atomically change the target of that rule from ACCEPT to DROP ... In case we test on HA-firewalls or iptables are otherwise used internally, we need to use RETURN and DROP instead. And we obviously have to make sure that we don't cut out the Exerciser itself, so it can revert that change. This needs to be done on all endpoints that would be affected by a real world link failure (NIC specific). Failures of NICs (single endpoints) would be simulated by only DROPing packets via the respective NIC on one specific node. This can be easily extended with rate-limitted drop firewall rule to "randomly" drop packets and see how the cluster communication layer copes with it.
We currently have the cts (ClusterTestSuite) and the cth (ClusterTestHarness) ... The cts is intended to be run in the presence of some cluster manager, to verify its proper operation. The cth implements its own sort-of cluster manager (very limited), and is intended to stress particular subsystems or cluster resources (e.g. DRBD) with hardware failures and client load in greater detail.
The cts in its current implementation and concept is the established method of QA of this project, and has proven very useful in catching bugs early, and making sure that fixed bugs stay fixed. Therefore it obviously cannot readily be replaced and all it's features must be preserved going forward. All operations and commands of the Exerciser are asynchronous, i.e. when, or if, they had the desired effect in general has to be verified by some other means. To recognize success or failure of its triggered operations, the cts looks for certain patterns in some consolidated logfile, respecting some timeout. Once you got the log-consolidating right (and you want to have this anyways), this is a very cool concept and handy piece of code. Of course, this imposes the requirement that all actions taken by the Exerciser, and each single event, as well as the respective success or failure, can be precisely identified in this log file by a simple one-line reqular expression search. Though it is possible to wait for N "patterns" in any order, we must make sure that the single patterns, and therfore the log messages, are unambiguous.
(LarsMarowskyBree thinks this is a fairly severe restriction...) The cts is currently "limited" to predetermined "test classes", which are typically called in some "random" order, with a randomly chosen node as argument, in case they need it... If the Test is sufficiently intelligent to recognize that it won't be a good idea to run, it can return immediately by just increasing its skipped count. All tests maintain their specific called, success and failure counts. Currently defined test classes include:
Maybe this should go into the cts page. It is in there anyways, but not as complete as here, and not with my words ...
I'd suggest to replace this with a more general hardware abstraction, and fail the node, as described above
Again, I suggest to replace this with a more general hardware abstraction, and failing the link or endpoint.
and you use DRBD 0.6.x; where (x<13), since it requires read access on the drbd devices in Secondary state.
pairs within its CM class. I suggest to move them as functions into some bash file, similar to what I did for the cth. Advantages: they can easily be re-used by some other testing system, even by simple bash scripts or interactively (again, see how this is done in cth now).
The CtsLab class should be extended with information about several hardware components and maybe topology, so said abstraction can be done, and one can write test classes that fail/heal some of the hardware components. This can be done similar to how cth does it now.
The test classes need to be reviewed for their multi-node (>2) awareness. The audit code needs to be supplemented with CRM-aware audits. Test classes simulating administrative requests should be added. Test classes for simulating client load should be added.
Maybe the cts (the CtsLab) should be configured itself and then push this configuratin into the cluster, instead of parsing the config files on the cluster nodes. Though a similar effect could be acchieved by some wrapper script around the cts.
(LarsMarowskyBree thinks this is a very important feature to easily test several different scenarios. Pointing the test harness at a bunch of nodes which are setup for ssh login and have all required software installed should be sufficient; all other scenario configuration should, as far as possible, come from the test scenario description.)
The consolidated logfile should include all relevant data, such as drbd state at critical points et cetera. If necessary, the test agents should go out and gather this information themselves. But in general, it should be totally unnecessary to manually retrieve additional logfiles for pin-pointing a problem found by the test harness. If this can easily be enabled in a runtime cluster, this will also ease in-the-field debugging and support. Call this a meta-test of the sensibility of our logging, if you will
(LarsMarowskyBree would like it even more if the cluster software internally generated this consolidated logfile without requiring a central syslog server to be configured. Or at least we need to make sure customers set this up correctly and thus have a good howto. Note that standard syslog is lossy and thus not a good idea to use.)
There are a number of Mini HowTos explaining this.
to the RandomTests class. It would present an audit/status window, and in some control window the nodes, comm links and storage devices, as well as the managed services and maybe some simulated clients. Then one could simply klick on some component to fail or heal it, add or remove client load, and see the effect in the audit/status window... With some gtk plugin, this seems affordable effort. Yes, compared to 500.000 automatic test iterations, the QA-effect would be neglegtable. But to debug specific problems, it would be nice to have. And by its high coolness factor, this gadget would probably be cheerfully used
by end users...
... cts <-> cth: some more comments eventually ...