The fundamental problem faced by user-level NFS servers

January 10, 2013

In a comment on yesterday's entry, Perry Lorier wrote in part:

[...] I know that I don't like userspace NFS servers, because last time I tinkered with one it went horribly horribly wrong. [...]

User-level NFS servers basically always explode; they are essentially intrinsically doomed unless they get very unusual support from the host Unix system. The core problem is our old friend NFS filehandles, which are the only identification the NFS server gets for what filesystem object the client wants to do something to. To create good ones, a NFS server needs some unique identifier for filesystem objects that has three properties: it is a stable identifier that stays with the file even if arbitrary things get renamed, it can be used to efficiently and rapidly access the file itself, and it must be invalidated when the file is deleted. Let's handwave the third property for now.

The problem is that Unix doesn't actually have an identifier with the first two properties that's generally available to user level. Filenames can be used to rapidly access an object but they aren't stable across renames, while inodes are stable across renames but there is no (general) 'open by inode' system call so they can't be opened rapidly in the general case. In the absence of such an identifier user-level NFS servers have no choice but to fake it in various ways, and their fakes break down every so often. When that happens you get some sort of explosion.

(You can keep a cache of inode to filename mappings but your cache entry may be missing or invalid, at which point you have to search the entire filesystem to find the file with the right inode.)

By the way: I'm ignoring the size of this identifier because I'm giving the user-level NFS server a persistent, arbitrary-sized database that it uses to keep track of filehandle to identifier mappings. If you don't want to have such a database, the identifier also needs to be relatively small.

(I feel that such a database is feasible under most circumstances. Most filesystems have only a few tens of thousands or hundreds of thousands of filesystem objects; this is not a big database these days. Even a few million objects is feasible, especially since indexing is easy.)

Kernel NFS servers don't have this problem because they have 'open by inode' (or really 'access by inode'). Well, usually they don't have this problem; they run into this exactly when they're trying to export a filesystem that doesn't use an inode-like identifier for its files. In Unix-like filesystems, generally either the filesystem's stable internal identifier is too large or the filesystem makes up its short 'inode numbers' in a way that doesn't let it look up an object by this number.

(In non-Unix-like filesystems, there may be no equivalent of an inode number at all; perhaps the file's name is the only identifier for it.)

PS: authors of new, sophisticated filesystems are not infrequently very grumpy about this inode number requirement and wish that it, along with NFS, would just go away. Sometimes they refuse to support inode numbers this way and then their new filesystem is not NFS exportable and a bunch of people become irritated with them. My view is that their time would be better spent implementing and advocating for a rich enough system call interface that a quality user-level NFS server is actually possible.

(Note that the 'open by stable identifier' system call would probably only be usable by root, which eliminates a whole class of security concerns.)

Written on 10 January 2013.
« It turns out I'm biased towards kernel iSCSI target implementations
A thought about static linking and popularity »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jan 10 00:13:14 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.