This site best when viewed with a modern standards-compliant browser. We recommend Firefox Get Firefox!.

Linux-HA project logo
Providing Open Source High-Availability Software for Linux and other OSes since 1999.

USA Flag UK Flag

Japanese Flag


About Us

Contact Us

Legal Info

How To Contribute

Security Issues

This web page is no longer maintained. Information presented here exists only to avoid breaking historical links.
The Project stays maintained, and lives on: see the Linux-HA Reference Documentation.
To get rid of this notice, you may want to browse the old wiki instead.

1 February 2010 Hearbeat 3.0.2 released see the Release Notes

18 January 2009 Pacemaker 1.0.7 released see the Release Notes

16 November 2009 LINBIT new Heartbeat Steward see the Announcement

Last site update:
2020-01-22 06:18:49

SFEX (Shared Disk File EXclusiveness Control Program)


Basic Consept

  • SFEX is resource which control ownership of shared disk.
  • SFEX uses an special partition on the shared disk, and maintains the following data.
    • "status" shows whether a disk is owned by somebody.
    • "node" shows the node name to own the disk.
    • "count" is used for judgment that owner node is up or down.
  • Typically, resources which use data partition on the shared disk(like PostgreSQL) make resource group with SFEX.
  • Resource of node which is holding ownership can access data partition.

  • When node can get ownership?
    • case1: Nobody has ownership.
    • case2: Node can judge another node is down.

sequence diagram

start up process

SFEX can start on the node which has the highest score in cib.xml because more than one nodes do not access shared disk at the same time.

Node A

  1. SFEX reads data from shared disk, and get "status". Usually "status" is "NO_OWNED" because nobody has owned shared disk.
  2. Writes data that include node=Node A and status=OWNED.
  3. Reads data again, and get "node=Node A".
  4. Compareses it with my node name. If node name has not been changed, Node A get ownership!!
  5. SFEX increments "count" on the shared disk by monitor processing of heartbeat. This processing means the update of ownership.

Heartbeat communication failuer

Node A

  1. SFEX updates ownership by HB monitor processing.

Node B

  1. When heartbeat communication fail, standby node(Node B) starts resources.

  2. SFEX reads data on the sheard disk.
  3. Waits a while. Wait time should be longer than sfex monitor interval. By this wait time, it waits for periodical update from Node A and confirms that Node A maintains ownership.
  4. Reads data again.
  5. Checks value of new "count". When the values of two "count" are different, it is able to think that Node A is up.
  6. SFEX starts up process is stopped.

Active Node failure

Node A

  1. Node A is downed by failure.

Node B

This Node B start up in the same way as HB communication failure.

  1. Waits for a while. It waits for periodical update from Node A but confirms that Node A does not it.
  2. SFEX reads data again.
  3. Checks value of new "count". The values of two "count" are SAME, it is able to think that Node A is DOWN.
  4. Writes data that include node=Node B and status=OWNED.
  5. Reads data again.
  6. Compareses it with my node name. If node name has not been changed, Node B get ownership!!
  7. Afterwards, other resources start.

Disk access on the same time

This is hardly generated. However for example, this case occurs when multiple nodes start up at the same time without heartbeat communication.

Node A / Node B

Writing to shared disk is serialized finally because writable area is "one". As a result, the node name written at the last time remains. In this example, Node B remains.

  1. Read data again
  2. Node A: value of "owner" is changed. this node does not get ownership. Node B: value of "owner" is name of Node B. Node B get ownnership!!

sample cib.xml

 <cib admin_epoch="0" epoch="1" have_quorum="false" cib_feature_revision="1.3">
      <cluster_property_set id="set01">
          <nvpair id="symmetric-cluster"
            name="symmetric-cluster" value="true"/>
          <nvpair id="no-quorum-policy"
            name="no-quorum-policy" value="ignore"/>
          <nvpair id="stonith-enabled"
            name="stonith-enabled" value="false"/>
          <nvpair id="short-resource-names"
            name="short-resource-names" value="true"/>
          <nvpair id="is-managed-default"
            name="is-managed-default" value="true"/>
          <nvpair id="default-resource-stickiness"
            name="default-resource-stickiness" value="INFINITY"/>
          <nvpair id="stop-orphan-resources"
            name="stop-orphan-resources" value="true"/>
          <nvpair id="stop-orphan-actions"
            name="stop-orphan-actions" value="true"/>
          <nvpair id="remove-after-stop"
            name="remove-after-stop" value="false"/>
          <nvpair id="default-resource-failure-stickiness"
            name="default-resource-failure-stickiness" value="-INFINITY"/>
          <nvpair id="stonith-action"
            name="stonith-action" value="reboot"/>
          <nvpair id="default-action-timeout"
            name="default-action-timeout" value="120s"/>
          <nvpair id="dc-deadtime"
            name="dc-deadtime" value="10s"/>
          <nvpair id="cluster-recheck-interval"
            name="cluster-recheck-interval" value="0"/>
          <nvpair id="election-timeout"
            name="election-timeout" value="2min"/>
          <nvpair id="shutdown-escalation"
            name="shutdown-escalation" value="20min"/>
          <nvpair id="crmd-integration-timeout"
            name="crmd-integration-timeout" value="3min"/>
          <nvpair id="crmd-finalization-timeout"
            name="crmd-finalization-timeout" value="10min"/>
          <nvpair id="cluster-delay"
            name="cluster-delay" value="180s"/>
          <nvpair id="pe-error-series-max"
            name="pe-error-series-max" value="-1"/>
          <nvpair id="pe-warn-series-max"
            name="pe-warn-series-max" value="-1"/>
          <nvpair id="pe-input-series-max"
            name="pe-input-series-max" value="-1"/>
          <nvpair id="startup-fencing"
            name="startup-fencing" value="true"/>
      <group id="grpPostgreSQLDB">
        <primitive id="prmExPostgreSQLDB" class="ocf" type="sfex" provider="heartbeat">
            <op id="exPostgreSQLDB_start"
              name="start" timeout="180s" on_fail="fence"/>
            <op id="exPostgreSQLDB_monitor"
              name="monitor" interval="10s" timeout="60s" on_fail="fence"/>
            <op id="exPostgreSQLDB_stop"
              name="stop" timeout="60s" on_fail="fence"/>
          <instance_attributes id="atrExPostgreSQLDB">
              <nvpair id="dskPostgreSQLDB"
                name="device" value="/dev/cciss/c1d0p1"/>
              <nvpair id="idxPostgreSQLDB"
                name="index" value="1"/>
              <nvpair id="cltPostgreSQLDB"
                name="collision_timeout" value="1"/>
              <nvpair id="lctPostgreSQLDB"
                name="lock_timeout" value="70"/>
              <nvpair id="mntPostgreSQLDB"
                name="monitor_interval" value="10"/>
              <nvpair id="fckPostgreSQLDB"
                name="fsck" value="/sbin/fsck -p /dev/cciss/c1d0p2"/>
              <nvpair id="fcmPostgreSQLDB"
                name="fsck_mode" value="check"/>
              <nvpair id="hltPostgreSQLDB"
                name="halt" value="/sbin/halt -f -n -p"/>
        <primitive id="prmFsPostgreSQLDB" class="ocf" type="Filesystem" provider="heartbeat">
            <op id="fsPostgreSQLDB_start"
              name="start" timeout="60s" on_fail="fence"/>
            <op id="fsPostgreSQLDB_monitor"
              name="monitor" interval="10s" timeout="60s" on_fail="fence"/>
            <op id="fsPostgreSQLDB_stop"
              name="stop" timeout="60s" on_fail="fence"/>
          <instance_attributes id="atrFsPostgreSQLDB">
              <nvpair id="devPostgreSQLDB"
                name="device" value="/dev/cciss/c1d0p2"/>
              <nvpair id="dirPostgreSQLDB"
                name="directory" value="/mnt/shared-disk"/>
              <nvpair id="fstPostgreSQLDB"
                name="fstype" value="ext3"/>
        <primitive id="prmIpPostgreSQLDB" class="ocf" type="IPaddr" provider="heartbeat">
            <op id="ipPostgreSQLDB_start"
              name="start" timeout="60s" on_fail="fence"/>
            <op id="ipPostgreSQLDB_monitor"
              name="monitor" interval="10s" timeout="60s" on_fail="fence"/>
            <op id="ipPostgreSQLDB_stop"
              name="stop" timeout="60s" on_fail="fence"/>
          <instance_attributes id="atrIpPostgreSQLDB">
              <!-- chenge ip address attribute -->
              <nvpair id="ipPostgreSQLDB" name="ip" value="aaa.bbb.ccc.ddd"/>
              <nvpair id="maskPostgreSQLDB" name="netmask" value="nn"/>
              <nvpair id="nicPostgreSQLDB" name="nic" value="bond0"/>
        <primitive id="prmApPostgreSQLDB" class="ocf" type="pgsql" provider="heartbeat">
            <op id="apPostgreSQLDB_start"
              name="start" timeout="60s" on_fail="fence"/>
            <op id="apPostgreSQLDB_monitor"
              name="monitor" interval="30s" timeout="60s" on_fail="fence"/>
            <op id="apPostgreSQLDB_stop"
              name="stop" timeout="60s" on_fail="fence"/>
          <instance_attributes id="atrApPostgreSQLDB">
              <nvpair id="pgctl01"
                name="pgctl" value="/usr/local/pgsql/bin/pg_ctl"/>
              <nvpair id="psql01"
                name="psql" value="/usr/local/pgsql/bin/psql"/>
              <nvpair id="pgdata01"
                name="pgdata" value="/mnt/shared-disk/pgsql/data"/>
              <nvpair id="pgdba01"
                name="pgdba" value="postgres"/>
              <nvpair id="pgdb01"
                name="pgdb" value="template1"/>
              <nvpair id="logfile01"
                name="logfile" value="/var/log/pgsql.log"/>
      <rsc_location id="rlcPostgreSQLDB" rsc="grpPostgreSQLDB">
        <rule id="rulPostgreSQLDB_node01" score="200">
          <expression id="expPostgreSQLDB_node01"
            attribute="#uname" operation="eq" value="sfex01" />
        <rule id="rulPostgreSQLDB_node02" score="100">
          <expression id="expPostgreSQLDB_node02"
            attribute="#uname" operation="eq" value="sfex02"/>
      <rsc_location id="ping1:disconn" rsc="grpPostgreSQLDB">
        <rule id="ping1:disconn:rule" score="-INFINITY" boolean_op="and">
          <expression id="ping1:disconn:expr:defined"
            attribute="default_ping_set" operation="defined"/>
          <expression id="ping1:disconn:expr:positive"
            attribute="default_ping_set" operation="lt" value="100"/>

Release Notes