
This page is old information that used to be on the HaNFS[1] page but was moved here once it was determined that locks do not survive failover with current kernels. It is saved here in case that problem is one day solved. -- Dave Dykstra
In order to verify the behavior of NFS locking, we have done extensive testing on NFS in an HA environment with Heartbeat[2]. This section describes this testing, and the results. We tested NFS I/O with Bonnie++, and tested NFS with the Connectathon suite, and also with a multiple-client NFS locking test of our own design.
We used the following ha.cf[3] file:
debugfile /var/log/ha-debug logfile /var/log/ha-log logfacility local0 keepalive 2 deadtime 10 warntime 10 initdead 20 udpport 694 bcast eth0 # Linux auto_failback off node posic066 node posic067 apiauth ping gid=haclient uid=gshi,hacluster apiauth ccm gid=haclient uid=hacluster apiauth evms gid=haclient uid=root apiauth ipfail gid=haclient uid=gshi,hacluster
If you don't use any locking, NFS works quite well with Linux-HA. Bonnie++ (version 1.03a, you can download it in http://www.coker.com.au/bonnie++/[4]) finished running successfully in around 6 hours with two NFS servers failover back and forth for every 2 minutes.
Steps to run a test:
start Heartbeat on both servers: posic066 and posic067
reboot posic066/posic067 in every 5 minutes. Since auto_failback[5] is set to off, this will make the NFS server switch in every five minutes.
haresources[6]:
posic067 xxx.xxx.61.111 Filesystem::/dev/sdb1::/data::ext3 nfslock nfsresult: it failed after some iterations, errno=37 "no lock record avaiable"
haresources[6]:
posic067 Filesystem::/dev/sdb1::/data::ext3 nfslock nfs xxx.xxx.61.111result: failed with errno=37
The source code for the multiple clients lock test code can be found in the contrib/mlock/ directory in the Linux-HA Mercurial[7] repository.
Steps to run a test:
start Heartbeat on both servers: posic066 and posic067
reboot posic066/posic067 in every 5 minutes. Since auto_failback[5] is set to off, this will make the NFS server switch in every five minutes.
haresources[6]:
posic067 xxx.xxx.61.111 Filesystem::/dev/sdb1::/data::ext3 nfslock nfsresult: failed with errno=37
haresources[6]:
posic067 Filesystem::/dev/sdb1::/data::ext3 nfslock nfs xxx.xxx.61.111result: succeeded once, but failed with errno =11 as we tested on 5/18/2004
haresources[6]:
posic067 portblock::tcp::111::block portblock::udp::111::block xxx.xxx.61.111 \
Filesystem::/dev/sdb1::/data::ext3 nfslock nfs \
portblock::tcp::111::unblock portblock::udp::111::unblock
result: failed with errno=37
We also tried a wrapper function to override fnctl. In that wrapper function fcntl will be called twice if it fails the first time. Using this wrapper function does return successfully sometimes, but it can still fail.
When the client is running a lock test, if the server failover happens, there is a chance that unmounting the file system will fail. The lock test we ran is Connectathon. This can be easily reproduced by the following steps with only two machines (one for server and one for client):
====> returns error: the device is busy
We always used same kernel version in both the server and the client. We have tried kernel 2.4.20, 2.4.26, 2.6.5-1.339, with Red Hat 9. All of these kernels fail the same way.
However, this is not a disaster from an HA point of view, since Linux-HA (version 1.2.1 or newer) will automatically reboot the machine if this occurs, in order to continue services automatically. Although it is annoying, service continues virtually uninterrupted, and the integrity of the locks and data is unaffected.
JeffLayton[8] found a fix to this problem from the linux-NFS mailing list, which as of May, 2004 the distros need to incorporate into their NFS shutdown scripts. According to Jeff[9], if one sends a SIGKILL signal to the lockd kernel thread, then it will release all its locks and the filesystem can be unmounted. This was discussed earlier on lkml[10].
If client applications do not use file locking, HA NFS works very well. However, if a client application uses locking, it may get errors that it will not get in a single NFS server. IMHO there are some bugs in NFS that cause problems above. -- GuochunShi[11]
However, most of these are now fixed, if you're running the right NFS kernel. But, the occasional lock failure in intensive locking can still occur. There is not yet any known solution. -- AlanRobertson[12]
| [1] | http://www.linux-ha.org/HaNFS |
| [2] | http://www.linux-ha.org/HeartbeatProgram |
| [3] | http://www.linux-ha.org/ha.cf |
| [4] | http://www.coker.com.au/bonnie++/ |
| [5] | http://www.linux-ha.org/ha.cf/AutoFailbackDirective |
| [6] | http://www.linux-ha.org/haresources |
| [7] | http://www.linux-ha.org/Mercurial |
| [8] | http://www.linux-ha.org/JeffLayton |
| [9] | http://lists.community.tummy.com/pipermail/linux-ha/2004-May/011128.html |
| [10] | http://seclists.org/lists/linux-kernel/2002/Sep/1841.html |
| [11] | http://www.linux-ha.org/GuochunShi |
| [12] | http://www.linux-ha.org/AlanRobertson |
This information provided courtesy of the Linux-HA project at http://linux-ha.org/