Wandering Thoughts archives

2009-01-22

The NFS re-export problem

One of the things that people traditionally ask for is an NFS server that can re-export its own NFS mounts (possibly among other things). Unfortunately this is impossible in general, and now I can explain why.

(In the old days, one reason people wanted this was because user level NFS servers were basically the only way to do user level filesystems at all, and it would be useful if your machine could transparently re-export your interesting user level filesystem to other machines. I'm not sure why people ask for this today.)

The core problem is that given only an NFS filehandle, the re-exporting server must both identify what real NFS server to pass the request to and recover the full original NFS filehandle to give that server. However, NFS filehandles are size-limited opaque blobs, all of which may be significant to the original NFS server; in general, there is no space to add your own identifier to a filehandle while being sure that you won't lose any information.

(If you are lucky, the NFS server that you are re-exporting uses short NFS v3 filehandles instead of full sized ones, leaving you enough space to glue the information you need on the front. Don't count on it, though.)

You can make up your own filehandle for every original filehandle that you re-export, but then you have to keep track of the mapping and do so in such a way that it persists over crashes and reboots. This is not impossible, but you're potentially going to be dealing with a lot of filehandles and NFS's statelessness means that you have no idea when clients have stopped using a particular filehandle.

(I suppose that this is less of an issue today, since NFS v3 filehandles are only at most 64 bytes long and disks get bigger every year; you can store a lot of mappings in a few gigabytes, and there are ways to revalidate and prune your database every so often.)

NFSReexportProblem written at 00:36:18; Add Comment

2009-01-20

Why high availability NFS requires shared storage

Suppose that you have a situation where you need transparent high availability NFS for something that is read-only and updated only infrequently. Instead of going to the expense and bother of setting up real shared storage, it's tempting to try to implement this by setting up a number of fileservers with local copies of the filesystem, synchronizing it from a master machine with rsync, and then using a slight variant of the basic HA NFS setup.

Unfortunately, the tempting easy way doesn't work; you can't have transparent high availability unless you have shared storage or something that fakes it very well.

In order to have transparent HA, you need transparent failover. In order to have transparent failover, you need to keep the NFS filehandles the same. In almost all NFS implementations and filesystems, keeping the NFS filehandles the same requires keeping the inode numbers and generation counts of every file and directory exactly the same across all copies of the data.

No user-level tool can do this; there is no Unix interface to set the inode number or the generation count when you create or manipulate a file (okay, this is not quite true; at least some Linux filesystems have a private interface to set the generation count of an inode, although this still doesn't help with the inode numbers). So the inevitable conclusion is that you must replicate your filesystem at some level below the normal user level one that rsync and company use.

The most general solution is shared storage, where you don't have to replicate anything at all. If you absolutely can't do shared storage, I can think of two general alternatives: synchronize the raw disk instead of the filesystem (possibly still with rsync), or do cross-network disk mirroring of some sort between the master and the replicas.

(Plausible methods of cross-network disk mirroring include Linux's DRBD and using an iSCSI or AOE target implementation on the replicas to export raw disks to the master.)

HANFSAndSharedStorage written at 18:26:42; Add Comment

The inner life of NFS filehandles

The NFS protocol uses something called a 'file handle' to identify what file a given operation applies to; this is sort of analogous to how a traditional Unix system internally identifies files by their inode number (well, their device plus their inode number).

In theory a NFS filehandle is opaque and an NFS server can use any scheme it wants in order to uniquely identify files. In practice there are a number of constraints on how a NFS server can form filehandles, based on preserving as many Unix semantics as possible; for example, the filehandle can't change just because someone renamed the file, and it has to be stable over server restarts.

A traditional, straightforward NFS server implementation puts three major things into a filehandle:

  • some identifier of the (server) filesystem that the file is on. Originally this was the (Unix) device that the filesystem was mounted on, but these days you can often set this explicitly (Linux uses the fsid= option in exports(5), for example), or the NFS server will somehow generate a stable number for each different filesystem based on various things.

  • some stable identifier of the file itself that will allow the NFS server to quickly look it up, traditionally the inode number.

  • a generation count for the inode number, to detect when a NFS filehandle refers to an older version of an inode.

The generation count is needed because traditional Unix filesystems can reuse the inode and thus the inode number of a deleted file, and when a file is deleted you want all the old NFS filehandles for it to stop working instead of suddenly referring to whatever new file inherits the inode. So it is the combination of the inode number and the generation count that uniquely identifies an abstract 'file' in a particular NFS-exported filesystem.

(Of course this is only necessary if your filesystem reuses its equivalent of inode numbers. If it does not, you don't need generation counts.)

Historically, bugs about when inode generation counts did or did not get updated have been a fruitful source of peculiar NFS problems. For example, I believe that there was once a Unix system that updated the generation count when you truncated a file.

NFSFilehandleInternals written at 01:02:25; Add Comment

2009-01-09

A Unix shell glob trick

This is the kind of trick where first I show the trick and then I explain it:

$ touch a-b; mkdir a-c
$ cd a-*
sh: cd: a-b: Not a directory
$ cd a-*/
$ pwd
/tmp/a-c

(This is also a good illustration of quality of implementation in error handling. A number of non-bash Bourne shells will report things like 'cd: too many arguments', while bash would happily work if a-b happened to be a directory.)

What this does is use a trick to pick the directory out of an otherwise ambiguous wildcard expansion. When there's a / on the end, the shell will conveniently restrict the wildcard expansion to directories, or in the cases where I usually wind up using this, the directory.

(The usual case for me is that I have just unpacked foo-1.2.tar.gz, creating foo-1.2, and now I want to cd into the latter without having to type the full name (my usual shell doesn't have filename completion by default), but there are others that come up every so often.)

Reading very carefully between the lines, I think that this behavior is required by the SUS. In general a shell might as well support this, since you can always write the wildcard as 'a-*/.' to force the issue.

A closely related trick can be used to find all of the subdirectories in your current directory (or in general, somewhere): 'echo */.'. In theory 'echo */' should be equivalent, but many shells seem to need the issue forced. I don't understand why those shells need this; that they behave differently for these two cases makes my head hurt.

(Ironically, bash gets this one right, and I believe that getting it right is the SUS-required behavior.)

ShellGlobTrick written at 00:18:03; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.