Why the NFS client is at fault in the multi-filesystem NFS problem

November 5, 2009

In yesterday's entry, I said that the NFS clients were at fault in dealing with the duplicate inode number problem. Now it's time for the details, because on first look this appears a bit odd; how can it be the client's responsibility to avoid duplicate inode numbers, when the server gives it the inode numbers?

In the NFS v3 specification, inode numbers only appear in one spot; they're part of the file attribute structure that the server returns for GETATTR requests. While it is used for more than just stat(), GETATTR is the NFS analog of the stat() system call and the fattr3 structure that it returns is the analog of the kernel's struct stat that stat() fills in, and much the same information appears in both.

In particular, the fattr3 structure has both a fileid (the inode number) and a fsid, the 'file system identifier for [the file's] file system'. While NFS v3 requires that the inode number to be unique it only requires that it be unique within a single server filesystem, that is, for files with the same fsid. And an NFS server is free to give you files with different fsids even though you have only made one NFS mount from it, of what you think is a single filesystem.

The simple way for clients to map between GETATTR and stat() is to turn the fileid into the inode number, fill in st_dev based on some magic internal number you're using for this NFS mount, and throw away the fsid. A kernel that does this has the duplicate inode number problem.

Unfortunately, fixing this is complicated. The NFS client cannot simply use the fsid for st_dev, because st_dev must be unique on the local machine and the fsid comes from the server; thus, it can potentially collide both with local filesystems and with filesystems from other NFS servers. Using fsid at all in the stat() results requires somehow inventing a relatively persistent and unique st_dev value for every different fsid that every NFS server gives you, which is non-trivial.

(If you have a very big st_dev you can deal with the problem by mangling the fsid together with a unique local number for this NFS mount. But fsid is a 64-bit number, so you'd need a pretty epic st_dev.)

Sidebar: the Linux solution to this problem

The Linux NFS client has a creative solution to this problem: it actually creates new NFS-mounted filesystems on the fly, complete with new local st_dev values, every time you traverse through a point where the fsid changes. Comments in the source code say that this has the side effect of making df work correctly, at least as long as you are not dealing with something like ZFS.

Written on 05 November 2009.
« The cause of the multi-filesystem NFS export problem
A shell script thing that I have learned the hard way »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Nov 5 00:02:51 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.