In an idealized view, a high-availability system is supposed to run for 100 years across many different revisions of applications, operating systems, hardware, architecture, and networking.
This means that HA systems must have a very very low level of errors in them, and they must report what they're doing so the administrators do the right thing in response to what the system is doing. We want to be able to diagnose half or more of the systems' problems without turning on debugging.
Both of these require proper logging, done right. Since debug-level logging and "normal" logging have some different characteristics, we will discuss them separately.
The audience for non-debug logging is the SysAdmin and/or systems management tools. Ideally, for a "simple" transition, we should not produce more than a dozen or so log messages at this level. Beyond that, and they clutter the system administrators logs with information they don't know what to do with the messages - except that it teaches the SysAdmin to ignore Linux-HA messages.
CRIT: Get a system administrator now. Something is dire need of fixing. This is an abnormal condition which has probably resulted in the loss of data integrity, or service, and probably reflects an unrecovered (or poorly recovered) error.
ERR: This should not happen, but the software may have recovered from it correctly. Due to the nature of error recovery, the recovery may or may not be satisfactory. System service may be impaired and recovery is not certain. These may be the result of internal audits which indicate that the software does not believe itself sane. ERR and CRIT messages deserve administrator attention.
WARN: Something has happened which is noteworthy, but with which the sofware is fully prepared to cope. Such errors might be classified as normal, minor, or expected. But, they may reflect some underlying condition which ought to be dealt with in the near future. For example, in an HA system, it is warnable condition that a machine has crashed - because we're fully prepared to deal with it.
INFO: Something has occurred which is not considered an error (by itself), and this message provides information about exactly what has occurred for the SysAdmin. It is not generally something to worry about. It may be accompanied by other messages which are cause for concern. The intent of such a message is similar to thought you might like to know....
Note that the boundary between ERR and WARN is of special importance. Anything which is marked ERR or CRIT will cause a test to be marked as failed. Anytime there is a doubt when writing the code, make the message ERR rather than WARN, so that we are forced to investigate the situation more fully when it actually occurs.
If on investigation, and running CTS, after careful and thoughtful examination decide that a particular condition should not result in automatic failure of a test, it is appropriate to change a message from ERR to WARN.
However, this should never be done lightly - as mistakes in classification may result in masking problems. Since our goal is to run 100 years without stopping, we cannot afford to mask any errors. Only our designers and coders can make this decision, and their integrity is something we rely on - as well as their skill.
One may discover that a low level function is reporting something as an error which is not always an error of this nature. In this case, deciding to raise an ERR can be delegated to the caller, who probably has more context in which to make this decision.
For example, one might have a library function which fetched the value of a variable. From the library layer, attempting to fetch a value which isn't there is not an ERR type condition, as it doesn't know whether the value ought to be there. But, the caller probably does, and should make the decision about the severity of that data item not being present. So, it would be inappropriate for such a library function to print an ERR for this condition. But, it would be appropriate for it to print an ERR (or maybe even a CRIT) if the data structures which it was given to operate on are corrupted.
Recently the CRMd was complaining that when it tried to stop the PE, the PE had already disconnected. It was very tempting to think that stopping something that is already stopped isnt really an ERR and change it to a warning. However by _not_ doing that, I discovered that the PE and TE were shutting down because of heartbeat's SIGTERM and before the CRMd was actually ready for them to stop.
Hopefully this emphasises Alan's point about being very sure that the log message really isn't an ERR before you change it to something else. A skeptical attitude and an eye for detail are great assets here.
In about half of the circumstances, it should be possible to debug a problem without any special debug information beyond normal logs. But, for the other half, we have additional debug information.
Debug level 1: This is intended to produce hundreds or at most a few thousands of lines of output for each transition. Nothing should be logged in level 1 which occurs continually (i.e., at times without a transition). It is hoped that 80% or more of problems can be debugged with debug level 1 or less. The system can operate usably (if verbosely) for long periods of time at debug level 1.
Debug level 2: This is intended to produce a great deal more detail than level 1 but not millions of lines for a single transition. It is probable that the system cannot operate usefully for more than a few hours at debug level 2.
Higher levels: Higher debug levels should mainly be needed while doing unit testing, debugging libraries, and individual developer testing. It is probable that the system can only operate practically a few minutes at such levels.