This site best when viewed with a modern standards-compliant browser. We recommend Firefox Get Firefox!.

Linux-HA project logo
Providing Open Source High-Availability Software for Linux and other OSes since 1999.

USA Flag UK Flag

Japanese Flag

Homepage

About Us

Contact Us

Legal Info

How To Contribute

Security Issues

This web page is no longer maintained. Information presented here exists only to avoid breaking historical links.
The Project stays maintained, and lives on: see the Linux-HA Reference Documentation.
To get rid of this notice, you may want to browse the old wiki instead.

1 February 2010 Hearbeat 3.0.2 released see the Release Notes

18 January 2009 Pacemaker 1.0.7 released see the Release Notes

16 November 2009 LINBIT new Heartbeat Steward see the Announcement

Last site update:
2017-12-11 18:53:34

Linux-HA Architecture (Release 2)

Release 2 has a number of components which implement its high-availability cluster management capabilities. This page provides an overview of the architecuture into which these components fit.

The components are as follows:

  • heartbeat - strongly authenticated communications module

  • CRM - Cluster Resource Manager

    • PE - CRM Policy Engine

    • TE - CRM Transition Engine

    • CIB - Cluster Information Base

  • CCM - Strongly Connected Consensus Cluster Membership

  • LRM - Local Resource Manager

  • Stonith Daemon - provides node reset services

  • logd - non-blocking logging daemon

  • apphbd - application level watchdog timer service

  • Recovery Manager - application recovery service

  • Infrastructure
  • CTS - Cluster Testing System - cluster stress tests

  • Special Glib notes

Architecture Diagram

heartbeat - strongly authenticated communication module

Function: startup, shutdown, strongly authenticated communication.

Provides locally-ordered multicast messaging over basically any media - IP-based or not.

Currently talks over the following media plugin types:

  • unicast UDP ipv4

  • broadcast UDP ipv4

  • multicast UDP ipv4

  • serial links (non-IP) - great for use with firewalls, etc. that play with iptables.

  • openais - uses OpenAIS' evs communication layer as a communication medium.

  • Psuedo-membership communication services:
    • ping - pings an individual router - allows us to treat it as a pseudo-member.

    • ping_group - similar to ping except if any in group is up, then the group is up

    • hbaping - "pings" a fiber channel disk for connectivity

Heartbeat can detect node failure reliably in less than a half-second. With the low-latency patches, (and maybe a bug fix or so), that time could be lowered significantly.

It will register with the system watchdog timer if configured to do so.

The heartbeat layer has an API which provides the following classes of services:

  • intra-cluster communication - sending and receiving packets to cluster nodes
  • configuration queries
  • connectivity information (who can the current node hear packets from) - both for queries and state change notifications
  • basic group membership services

Cluster Resource Manager

The Cluster Resource Manager (CRM) is the brains of Linux-HA. It maintains the resource configuration, decides what resources should run where, how to move from the current state into the state where they are running where they ought to be. In addition, it supervises the LRM in accomplishing all these things. The CRM interacts with every component in the system:

  • It uses heartbeat for communication

  • It receives membership updates from the CCM.

  • It directs the work of and receives notifications from the LRM.

  • It tells the Stonith Daemon when and what to reset

  • It logs using the logging daemon

In effect, the PE, TE, and CIB can be viewed as components of the CRM.

CRM Policy Engine (PE)

The primary goal of the policy engine is to compute a transition graph from the current state of the world. This transition graph takes into account the current location and state of resources, availability of nodes and current static configuration (aka the CurrentClusterState).

CRM Transition Engine (TE)

The Transition Engine effectively executes the transition graph produced by the PE - to make its dreams a reality. That is, it takes a computed next state for the cluster and the list of actions and attempts to reach it by instructing the LRM on remote nodes to start and stop resources.

Cluster Information Base (CIB)

The CIB is an automatically replicated repository for cluster resource and node information as seen by the CRM. This information includes static information such as dependencies, and more dynamic information such as what resources are running where, and what their states are.

All the information in the CIB is represented in XML for maximum usability. We have an annotated DTD which defines all the information we currently manage in the CIB.

Because of this, many of the features of the CRM can be understood by reading the annotated DTD.

Consensus Cluster Membership

Provides strongly connected consensus cluster membership services. Ensures that every node in a computed membership can talk to every other node in this same membership. Implements both an OCF draft membership API, and the SAF AIS membership API. Typically it computes membership in sub-second time.

Local Resource Manager

The Local Resource Manager is basically is a resource agent abstraction. It starts, stops and monitors resources as directed by the CRM.

It has a plugin architecture, and can support many classes of resource agents. The types currently supported include:

  • OCF - Open Cluster Framework
  • heartbeat (release 1) style resource agents
  • LSB - Linux Standards Base (normal init script) resource agents
  • stonith - resource agent interface for instantiating STONITH objects to

    be used by the StonithDaemon.

Other types are readily added for compatibility with other systems.

Stonith Daemon

The STONITH daemon provides cluster-wide node reset facilities using the improved release 2 stonith API.

The Stonith library includes support for around a dozen types of 'C' STONITH plugins and native support for script-based plugins - allowing scripts in any scripting language to be treated as full-fledged STONITH plugins.

The Stonith Daemon locks itself into memory, and 'C' based plugins are pre-loaded by the stonith daemon so that no disk I/O is required for a STONITH operation using 'C' methods.

The Stonith Daemon provides full support for asymmetric STONITH configurations - allowing for the possibility that certain STONITH devices may be accessible from only a subset of the nodes in the cluster.

Note that it is not currently intended that the Stonith Daemon be used for providing resource-granular fencing. Current thinking regarding resource-granular fencing calls for such fencing to be done by clone resource agents. The resources which need fencing will be dependent on these other resources. Clone resource agents are notified when their various peer resources start and stop.

logd - non-blocking logging daemon

Can log to syslog or files or both. logd never blocks.

Messages are lost if they get too far behind in preference to blocking. Count of messages lost is printed next time we can output messages. Queue sizes are controllable per-application - and overall.

Application Heartbeat Daemon (apphbd)

The application heartbeat daemon is a general service which provides watchdog timer facilities for individual HA-aware applications. When applications fail to check in with it in their prescribed time, interested parties are notified and (presumably) recovery actions taken. This daemon is a simple as we can make it, so it can be the most reliable component in the system. Many Linux-HA system components tie into it, but it is not commonly enabled in the field at this writing. It will register with the system watchdog timer if requested to do so.

Recovery Manager Daemon

The recovery manager daemon is notified by apphbd when a process fails to heartbeat or exits unexpectedly. It then takes actions to (kill and) restart the application.

Infrastructure

A key part of our success in implementation comes from having a consistent, flexible, reliable and general infrastructure underneath all the major components.

This infrastructure has several key elements:

  • A flexible general plugin mechanism
  • Non-blocking IPC layer
  • Complete avoidance of threads (so far)
  • Use of Glib mainloop as our uniform dispatching (scheduling) and event processing method

  • Significant consistency and integration with each other

PILS - Plugin and Interface Loading System

PILS provides a very general plugin loading system which is used extensively in Linux-HA.

This allows for great flexibility and power in the system, while minimizing the size of the running system. This tends to improve the architecture of those subsystems which use plugins. In addition, their power is increased, while minimizing the resource usage of the Linux-HA system on the host servers.

An unexpected benefit of plugins is that almost all the contributions from non-core members come in the area of plugins.

Plugins are currently used in the following areas: communication, authentication, stonith, resource agents, compression, apphbd notification methods.

IPC Library

All interprocess communication is performed using a very general IPC library which provides non-blocking access to IPC using a flexible queueing strategy, and includes integrated flow control. This IPC API does not require sockets, but the currently available implementations use UNIX (Local) Domain sockets.

This API also includes built in authentication and authorization of peer processes, and is portable to most POSIX-like OSes.

Although use of mainloop with these APIs is not required, we provide simple and convenient integration with mainloop.

Cluster Plumbing Library

The Cluster plumbing library is a collection of very useful functions which provide a variety of services used by many of our main components. A few of the major objects provided by this library include:

  • compression API (with underlying compression plugins)
  • Non-blocking logging API
  • memory management oriented to continuously running services
  • Hierarchical name-value pair messaging facility - promoting portability and version upgrade compatibility. Also provides optional message compression facilities.
  • Signal unification - allowing signals to appear as mainloop events
  • Core dump management utilities - promoting capture of core dumps in a uniform way, and under all circumstances
  • timers (like glib mainloop timers - but they work even when the time of day clock jumps)
  • child process management - death of children causes invocation of process object, with configurable death-of-child messages.
  • Triggers - arbitrary events triggered by software
  • Realtime management - setting and unsetting high priorities, and locked into memory attributes of processes.
  • 64-bit HZ-granularity time manipulation (longclock_t)
  • User id management for security purposes - for processes which need some root privileges.
  • Mainloop integration for IPC, plain file descriptors, signals, etc. This means that all these different event sources are managed and dispatched consistently.

Cluster Testing System

There are two kinds of bugs one finds in reports from users - those that one would not expect to find during testing and those which really should have gotten caught during testing. Linux-HA has a very low bug rate overall, and an extremely low number of bugs of the "should have gotten caught" category.

The Cluster testing system (CTS) is the primary cause for these low bug rates.

CTS is an automated cluster testing system which runs random stress tests on the cluster. Although it is in most ways a modest system with what seem like largely straightforward tests, it has proven extremely effective in practice.

Its basic strategy is: beat the software to death. Such testing has sometimes been called Bamm-Bamm testing.

CTS is an example of a system where the whole is more than simply the sum of its parts.

Glib Usage

The Linux-HA project makes extensive use version 2 of the Gnome Glib library and special use is made of the mainloop event processing structure.

The use of mainloop has made many things much easier and more uniform, and has allowed us (so far) to completely avoid threads and their attendant portability and debugging difficulties.

Data Flow through R2

  1. All communication starts with the heartbeat layer, and every component which communicates with other cluster members does it through the heartbeat layer. In addition, the heartbeat layer provides connectivity information indicating when communication with another node is lost, and when it's restored.

  2. These connectivity change events are then given to the membership layer (CCM), which then sends packets to its peers in the cluster and figures out exactly which nodes are in the current membership, and which ones aren't.
  3. Once it has computed a new membership,the CCM notifies its clients of the change in membership. Two of the CCMs most important clients are the CRM and the CIB.
  4. When it receives a new membership from the CCM, the CIB updates the CIB with the information from the latest membership update, and informs its clients that the CIB has been updated.
  5. When the CRM notices that the CIB has changed, it then notifies the policy engine (PE)
  6. The PE then looks at the CIB (including the status section) and sees what things need to be done to bring the state of the cluster (as shown by the status section of the CIB) in line with the defined policies (found in the configuration section of the CIB).
  7. The PE then creates a directed graph of actions that need to be taken (if any) to bring the cluster into line with the policy, and gives them to the CRM.
  8. The CRM then gives these actions to the transition engine (TE) which then proceeds to carry them out.
  9. The TE then (through the CRM) directs the various LRMs across the cluster to take the specified actions.
  10. Each time an action completes or times out, the TE gets notified of the status.
  11. The TE then continues to give more actions to the LRMs according to the graph it was given.
  12. When all actions have been completed, the TE reports success back to the CRM.