This site is a work in progress — you can help! Please see the Site news for details.

FAQ

From Linux-HA

Jump to: navigation, search

Contents

No Local Heartbeat

I got this message "ERROR: No local heartbeat. Forcing shutdown" and then Heartbeat shut itself down for no reason at all!

First of all, Heartbeat never shuts itself down for no reason at all. This kind of occurrence indicates that Heartbeat is not working properly, which in our experience can be caused by one of two things:

  • System under heavy I/O load, or
  • Kernel bug.

For how to deal with the first occurrence (heavy load), please read the answer to the next FAQ item. If your system was not under moderate to heavy load when it got this message, you probably have the kernel bug. The 2.4.18-2.4.20 Linux kernels had a bug in it which would cause it to not schedule Heartbeat for very long periods of time when the system was idle, or nearly so. If this is the case, you need to get a kernel that isn't broken.

Heavy Load

How to tune Heartbeat on heavily loaded system to avoid split-brain?

"No local heartbeat" or "Cluster node returning after partition" under heavy load is typically caused by too small a Ha.cf/deadtime_directive deadtime interval, or an older version of Heartbeat. Make sure you're running at least version 3.0.2. Here is a suggestion for how to tune deadtime:

If your never saw a "late heartbeat" message, then your chosen deadtime is fine - use it. Otherwise,

  • Set your deadtime to 1.5-2 times that amount.
  • Set warntime to Ha.cf/keepalive_directive keepalive*2.
  • Continue to monitor logs for warnings about long heartbeat times. If you

don't do this, you may get "Cluster node ... returning after partition" which will cause Heartbeat to restart on all machines in the cluster. This will almost certainly annoy you at a minimum.

Adding memory to the machine generally helps. Limiting workload on the machine generally helps. Newer versions of Heartbeat are a better about this than pre 3.0.x versions. Some customers report being able to set sub-second deadtimes in their applications. YMMV (!)

TTY timeout

I got this message "TTY write timeout on [/dev/ttyxxx]" but both nodes are up and I tested my serial cable

If both nodes are up, and your serial cable passes data, then the most probable explanation for the problem is that the serial cable does not pass the CTS and RTS leads through from end to end properly. Heartbeat requires these leads in order to avoid data loss.

How to use Heartbeat with Ipchains firewall?

To make Heartbeat work with Ipchains, you must accept incoming and outgoing traffic on 694 UDP port. Add something like

/sbin/ipchains -A output -i ethN -p udp -s <source_IP> -d <dest_IP>  -j ACCEPT
/sbin/ipchains -A input -i ethN -p udp -s <source_IP> -d <dest_IP>  -j ACCEPT

How to run multiple clusters on the same network segment?

Use Ha.cf/mcast_directive multicast and give each its own multicast group. If you need to/want to use broadcast, then run each cluster on different port numbers. An example of a configuration using multicast would be to have the following line in your Ha.cf file:

mcast eth0 224.1.2.3 694 1 0

This sets eth0 as the interface over which to send the multicast, 224.1.2.3 as the multicast group (will be same on each node in the same cluster), UDP port 694 (Heartbeat default), time to live of 1 (limit multicast to local network segment and not propagate through routers), multicast loopback disabled (typical).

Personal tools