Apple have addresses this issue in Mac OS X 10.3.6. They even sent me an email about it...
...
Thank you for filing Bug ID # 3825557. Our engineering
team has tested against your report and they are no
longer able to reproduce the issue with the latest
build of Mac OS X, version 10.3.6.
...
After running the fix for a couple of weeks, it does seem to have addressed the issue - thanks Apple.
Original problem:
After months of irritating lock ups I've finally isolated a bug in Apple's NFS server.
I've upgraded our Mac OS X file server from Jaguar (10.2) to the latest release of Panther (10.3). I've got three Linux machines currently NFS mounting various exports off the server.
Worth mentioning is that the Mac OS X server is hooked up to an XServe RAID cabinet with lots of space in the partition where the NFS is being served from:
$ df -k .
Filesystem 1K-blocks Used Available Use% Mounted on
lemon:/export2/home/martin
1470545880 59722020 1410823860 5% /home/martin
linux$ uname -a
Linux blueberry.salad.taglab.com 2.4.20-20.9 #1 Mon Aug 18 11:45:58 EDT 2003 i686 i686 i386 GNU/Linux
maxosx$ uname -a
Darwin lemon.salad.taglab.com 7.4.0 Darwin Kernel Version 7.4.0: Wed May 12 16:58:24 PDT 2004; root:xnu/xnu-517.7.7.obj~7/RELEASE_PPC Power Macintosh powerpc
Intermittently we experience a problem on the Linux machines where commands that look at the NFS mounted file system "hang" and don't seem to do anything.
Any command that checks the contents of a directory might stall - e.g. 'ls'
However, strace reveals:
$ strace ls
...
brk(0) = 0x805a000
brk(0x805d000) = 0x805d000
open("/dev/null", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOTDIR (Not a directory)
and so on, the getdents64 call keep being repeated against the same directory infinitely.
Using tcpdump when doing the above I can see some initial NFS traffic which then stops when the loop begins. This made me think that the NFS cache on the linux side is somehow being polluted.
Unmounting the NFS volume and remounting it always fixed the problem until it occurred again.
I tried various nfs mount options such as udp/tcp, nfsvers=2,nolock all to no avail.
Trond Myklebust quickly pointed out that the problem was most likely down to Mac OS X sending out non-unique cookies in the intitial READDIR reply - polluting the client side cache and any loops are consequently because of the local cache being wrong. My biggest problem was then to capture one of those READDIRs since the problem occurs so infrequently.
Today however I got lucky (or unlucky, not sure) - I now got a folder that displays this behavious every time I try to look at it. No umount/mount helps this time. This made it possible to do a proper tcpdump of the READDIR call. I've now got proof that the cookie not is unique.
PNG showing the problem in 'ethereal': ethereal.png