This site best when viewed with a modern standards-compliant browser. We recommend Firefox Get Firefox!.

Linux-HA project logo
Providing Open Source High-Availability Software for Linux and other OSes since 1999.

USA Flag UK Flag

Japanese Flag

Homepage

About Us

Contact Us

Legal Info

How To Contribute

Security Issues

This web page is no longer maintained. Information presented here exists only to avoid breaking historical links.
The Project stays maintained, and lives on: see the Linux-HA Reference Documentation.
To get rid of this notice, you may want to browse the old wiki instead.

1 February 2010 Hearbeat 3.0.2 released see the Release Notes

18 January 2009 Pacemaker 1.0.7 released see the Release Notes

16 November 2009 LINBIT new Heartbeat Steward see the Announcement

Last site update:
2017-11-24 22:16:42

This page is old information that used to be on the HaNFS page but was moved here once it was determined that locks do not survive failover with current kernels. It is saved here in case that problem is one day solved. -- Dave Dykstra

HA-NFS testing

In order to verify the behavior of NFS locking, we have done extensive testing on NFS in an HA environment with Heartbeat. This section describes this testing, and the results. We tested NFS I/O with Bonnie++, and tested NFS with the Connectathon suite, and also with a multiple-client NFS locking test of our own design.

Test environment

  • Two machines as NFS servers and two machine as clients.
  • A Qlogic Fiber Channel shared disk between two servers.
  • All servers and clients are running Red Hat 9, with kernel 2.4.26. The package version for NFS is nfs-utils-1.0.6-1
  • We used the following ha.cf file:

debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility     local0
keepalive 2
deadtime 10
warntime 10
initdead 20
udpport 694
bcast   eth0            # Linux
auto_failback off
node    posic066
node    posic067
apiauth ping gid=haclient uid=gshi,hacluster
apiauth ccm  gid=haclient uid=hacluster
apiauth evms gid=haclient uid=root
apiauth ipfail gid=haclient uid=gshi,hacluster

Bonnie++

If you don't use any locking, NFS works quite well with Linux-HA. Bonnie++ (version 1.03a, you can download it in http://www.coker.com.au/bonnie++/) finished running successfully in around 6 hours with two NFS servers failover back and forth for every 2 minutes.

Connectathon lock test

Steps to run a test:

  • start Heartbeat on both servers: posic066 and posic067

  • the client mounts NFS directory from floating IP: xxx.xxx.61.111
  • reboot posic066/posic067 in every 5 minutes. Since auto_failback is set to off, this will make the NFS server switch in every five minutes.

  • the client starts to run lock tests
    1. test 1

      haresources:

        posic067 xxx.xxx.61.111 Filesystem::/dev/sdb1::/data::ext3 nfslock nfs
        
      result: it failed after some iterations, errno=37 "no lock record avaiable"
    2. test 2

      haresources:

        posic067  Filesystem::/dev/sdb1::/data::ext3 nfslock nfs  xxx.xxx.61.111
        
      result: failed with errno=37

Multiple clients lock test

The source code for the multiple clients lock test code can be found in the contrib/mlock/ directory in the Linux-HA Mercurial repository.

Steps to run a test:

  • start Heartbeat on both servers: posic066 and posic067

  • two clients mount NFS directory from floating IP: xxx.xxx.61.111
  • reboot posic066/posic067 in every 5 minutes. Since auto_failback is set to off, this will make the NFS server switch in every five minutes.

  • the two clients start to run multiple clients lock tests (the test will run for ~10 hours if there is no failure)
    1. test 1

      haresources:

        posic067 xxx.xxx.61.111 Filesystem::/dev/sdb1::/data::ext3 nfslock nfs
        
      result: failed with errno=37
    2. test 2

      haresources:

        posic067 Filesystem::/dev/sdb1::/data::ext3 nfslock nfs xxx.xxx.61.111
        
      result: succeeded once, but failed with errno =11 as we tested on 5/18/2004
    3. test 3 use portblock to block sunrpc port during the time IP is up but nfs/nfslock is not running

      haresources:

        posic067  portblock::tcp::111::block portblock::udp::111::block xxx.xxx.61.111 \
                Filesystem::/dev/sdb1::/data::ext3 nfslock nfs                         \
                portblock::tcp::111::unblock portblock::udp::111::unblock
        

      result: failed with errno=37

We also tried a wrapper function to override fnctl. In that wrapper function fcntl will be called twice if it fails the first time. Using this wrapper function does return successfully sometimes, but it can still fail.

Bug in NFS?

When the client is running a lock test, if the server failover happens, there is a chance that unmounting the file system will fail. The lock test we ran is Connectathon. This can be easily reproduced by the following steps with only two machines (one for server and one for client):

  1. the server mounts a disk device to a directory
  2. the server starts nfs and nfslock
  3. the client mounts the exported directory from the server
  4. the client runs the Connectathon lock test
  5. the server shuts down nfs and nfslock
  6. the server tries to unmount the device
    • ====> returns error: the device is busy

We always used same kernel version in both the server and the client. We have tried kernel 2.4.20, 2.4.26, 2.6.5-1.339, with Red Hat 9. All of these kernels fail the same way.

However, this is not a disaster from an HA point of view, since Linux-HA (version 1.2.1 or newer) will automatically reboot the machine if this occurs, in order to continue services automatically. Although it is annoying, service continues virtually uninterrupted, and the integrity of the locks and data is unaffected.

JeffLayton found a fix to this problem from the linux-NFS mailing list, which as of May, 2004 the distros need to incorporate into their NFS shutdown scripts. According to Jeff, if one sends a SIGKILL signal to the lockd kernel thread, then it will release all its locks and the filesystem can be unmounted. This was discussed earlier on lkml.

Conclusion

If client applications do not use file locking, HA NFS works very well. However, if a client application uses locking, it may get errors that it will not get in a single NFS server. IMHO there are some bugs in NFS that cause problems above. -- GuochunShi

However, most of these are now fixed, if you're running the right NFS kernel. But, the occasional lock failure in intensive locking can still occur. There is not yet any known solution. -- AlanRobertson