This site is a work in progress — you can help! Please see the Site news for details.

Db2 (resource agent)

From Linux-HA

(Redirected from Db2 (resource agent)
Jump to: navigation, search

Contents

DB2 Cluster on shared storage

Scenario

A DB2 instance should be made highly available with 2 servers and shared storage. Once a server or the instance fails pacemaker will restart the instance or relocate it to the surviving server. Clients will connect to a service address that will always be active together with the healthy instance.

Functions of the agent

When Pacemaker calls the agent's start method the DB2 instance is started and all databases are activated (a subset of databases can be selected with parameter dblist in version 1.0.5 or later). The method monitor checks whether the instance is running and then probes individual databases by selecting from internal tables. The stop method tries everything to bring the instance down. First it tries db2stop force and if this don't works or hangs db2_kill. As a last resort Pacemaker will reboot the node with STONITH. Ultimately bringing an instance down on one node is mandatory for starting it elsewhere.

Prerequisites / Assumptions

The environment for this example is:

nodes
node-a and node-b
IP-address for instance
192.168.178.17/24 with DNS entry ha-inst1
DB2 instance user
db2inst1
The home directory of db2inst1 is /db2/db2inst1 and must be on shared storage.

On node-a install the DB2 software (on shared storage!) and create the instance db2inst1 as specified in the DB2 documentation (e.g. DB2 Information Center[1]).

Configure IP-address ha-inst1 on node-a for now.

Cluster-enable your DB2 instance

Instance creation (db2icrt) creates a file ~db2inst1/db2nodes.cfg containing the hostname of the node where the instance was created.

db2inst1@node-a:~> cat sqllib/db2nodes.cfg
0 node-a 0
db2inst1@node-a:~>

Of course this entry will be wrong on the other node node-b so we have to replace it with something that is valid on either node e.g ha-inst1.

db2inst1@node-a:~> cat sqllib/db2nodes.cfg
0 ha-inst1 0
db2inst1@node-a:~>

db2start and db2stop now consider this partition on a different node and try to access it with rsh. You either have to configure rsh, ssh or you can create and enable the script below:

db2inst1@node-a:~> cat db2_local_rsh 
#!/bin/sh
#
# Emulate inter partition call by simply doing it locally
# Install with
#
#   db2set DB2RSHCMD=$INSTHOME/db2_local_rsh
#

# Is called rsh like as 
# db2_local_rsh mynode.mydomain.my -n -l my_instance ARGS

# remove 4 first args
shift
shift
shift
shift
eval "$@"
db2inst1@node-a:~>

Now try out db2start / db2stop. It should work.

Configure Pacemaker

The configuration of shared storage and IP addresses is described elsewhere. For DB2 it's essential that the file system and the service IP address are co-located with the DB2 resource and ordered before the DB2 resource e.g. put them in a group.

node-a:~ # crm configure
primitive fs_db2inst1 ocf:heartbeat:Filesystem \
        op monitor interval="20" timeout="40" \
        params device="..." directory="/db2/db2inst1" fstype="..."

primitive ip_ha-inst1 ocf:heartbeat:IPaddr2 \
        op monitor interval="10s" timeout="20s" \
        params ip="192.168.178.17"

primitive db_db2inst1 ocf:heartbeat:db2 \
        op monitor interval="30" timeout="60" start-delay="10" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="120" \
        params instance="db2inst1"

group gr-db2inst1 fs_db2inst1 ip_ha-inst1 db_db2inst1 \
        meta target-role="stopped"

commit
node-a:~ #

Be very deliberate with specifying timeout values as these have to account for crash recovery etc..

Multipartition Support

Configure each partition as a separate resource using the dbpartitionnum parameter. Partion 0 must be started first e.g.

node-a:~ # crm configure

# partition 0
primitive db_db2inst1_0 ocf:heartbeat:db2 \
        op monitor interval="30" timeout="60" start-delay="10" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="120" \
        params instance="db2inst1" dbpartitionnum="0"

#partition 1
primitive db_db2inst1_1 ocf:heartbeat:db2 \
        op monitor interval="30" timeout="60" start-delay="10" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="120" \
        params instance="db2inst1" dbpartitionnum="1"

# required stuff for partition 0
group gr-db2inst1_0 fs_db2inst1_0 ip_ha-inst1 db_db2inst1_0

# partition 1 after 0 is up
order part1_after_part0 INFINITY: db_db2inst1_0 db_db2inst1_1

# ... and other colocation / order constraints depending on your setup for partition 1

commit
node-a:~ #

So what part of DB2 is now highly available ?

That is the instance including all databases. As pointed out above the stop method brings down all databases. As best practice configure one database per instance.

DB2 Cluster with HADR (new with release 1.0.5)

Scenario

A DB2 database is configured for HADR on two servers with local storage. Each server has a configured on instance. Once an instance or a complete server fails pacemaker will perform a db2 takeover hadr on the surviving instance. Clients will connect to a service address that will always be active together with the Primary of the HADR pair.

Functions of the agent

The agent must be configured as a master/slave resource. When Pacemaker calls the agent's start method the DB2 instances are started. The databases are activated: one in Primary and one in Standby mode. Pacemaker then decides which member of the pair to promote. The database is then brought into Primary role on the specific node. The method monitor checks whether instances are running and then probes the Primary by selecting from internal tables. Should the Primary fail the Standby is promoted (i.e. a db2 takeover hadr) is performed.
The stop method tries everything to bring the instance down. First it tries db2stop force and if this don't works or hangs db2_kill.

Prerequisites / Assumptions

The environment for this example is:

nodes
node-a and node-b
DB2 instance user
db2inst1
This instance is configured on both servers on local storage.
DB2 database
A database with name db1 is configured for HADR on node-a and node-b
IP-address for the database
192.168.178.17/24 with DNS entry ha-db1

Install the software and configure HADR for database db1 as specified in the DB2 documentation (e.g. DB2 Information Center[2]).


Configure Pacemaker

Configure a resource for the IP address and a master/slave resource for the database. The IP address should always be colocated with the master (a.k.a. Primary) and started after promotion.

node-a:~ # crm configure

# the IP resource
primitive ip_ha-db1 ocf:heartbeat:IPaddr2 \
        op monitor interval="10s" timeout="20s" \
        params ip="192.168.178.17"

# the DB resource, note the additional monitor op with role "Master"
primitive db_db2inst1 ocf:heartbeat:db2 \
        params instance="db2inst1" dblist="db1" \
        op start interval="0" timeout="130" \
        op stop interval="0" timeout="120" \
        op promote interval="0" timeout="120" \
        op demote interval="0" timeout="120" \
        op monitor interval="30" timeout="60" \
        op monitor interval="45" role="Master" timeout="60"

# the m/s resource, notifications are required 
ms ms_db2_db1 db_db2inst1 \
        meta target-role="stopped" notify="true"

colocation ip_db_with_master inf: ip_ha-db1:Started ms_db2_db1:Master
order ip_db_after_master inf: ms_db2_db1:promote ip_ha-db1:start

commit
node-a:~ #


Interaction of DB2's split brain prevention and Pacemaker

DB2 HADR has a builtin split brain prevention that can be summarized as follows:

  • A Primary can not be cold-started when the Standby is down.
  • A takeover can be constrained to the HADR_PEER_WINDOW (db2 takeover hadr on db mydb by force peer window only).


That means:

  • HADR_PEER_WINDOW must be configured (available with DB2 version >= V9).
  • On a cold start a database in role Master will not come up as long as the other Slave resource is down. The resource will be stuck in a start/stop loop as long as pacemaker allows.
  • Monitoring timeouts must be set so failure detection and takeover can be completed within HADR_PEER_WINDOW. The new Master then continues to work even if the other instance is down.
  • After a crash and possible takeover the crashed database comes up as Primary as well but will not activate because there is no Standby. Both databases exchange their "First active Log" position. Once the outdated 'old' Primary can be safely determined it is restarted as Standby.
Personal tools