Wandering Thoughts archives

2009-01-20

Why high availability NFS requires shared storage

Suppose that you have a situation where you need transparent high availability NFS for something that is read-only and updated only infrequently. Instead of going to the expense and bother of setting up real shared storage, it's tempting to try to implement this by setting up a number of fileservers with local copies of the filesystem, synchronizing it from a master machine with rsync, and then using a slight variant of the basic HA NFS setup.

Unfortunately, the tempting easy way doesn't work; you can't have transparent high availability unless you have shared storage or something that fakes it very well.

In order to have transparent HA, you need transparent failover. In order to have transparent failover, you need to keep the NFS filehandles the same. In almost all NFS implementations and filesystems, keeping the NFS filehandles the same requires keeping the inode numbers and generation counts of every file and directory exactly the same across all copies of the data.

No user-level tool can do this; there is no Unix interface to set the inode number or the generation count when you create or manipulate a file (okay, this is not quite true; at least some Linux filesystems have a private interface to set the generation count of an inode, although this still doesn't help with the inode numbers). So the inevitable conclusion is that you must replicate your filesystem at some level below the normal user level one that rsync and company use.

The most general solution is shared storage, where you don't have to replicate anything at all. If you absolutely can't do shared storage, I can think of two general alternatives: synchronize the raw disk instead of the filesystem (possibly still with rsync), or do cross-network disk mirroring of some sort between the master and the replicas.

(Plausible methods of cross-network disk mirroring include Linux's DRBD and using an iSCSI or AOE target implementation on the replicas to export raw disks to the master.)

unix/HANFSAndSharedStorage written at 18:26:42; Add Comment

The inner life of NFS filehandles

The NFS protocol uses something called a 'file handle' to identify what file a given operation applies to; this is sort of analogous to how a traditional Unix system internally identifies files by their inode number (well, their device plus their inode number).

In theory a NFS filehandle is opaque and an NFS server can use any scheme it wants in order to uniquely identify files. In practice there are a number of constraints on how a NFS server can form filehandles, based on preserving as many Unix semantics as possible; for example, the filehandle can't change just because someone renamed the file, and it has to be stable over server restarts.

A traditional, straightforward NFS server implementation puts three major things into a filehandle:

  • some identifier of the (server) filesystem that the file is on. Originally this was the (Unix) device that the filesystem was mounted on, but these days you can often set this explicitly (Linux uses the fsid= option in exports(5), for example), or the NFS server will somehow generate a stable number for each different filesystem based on various things.

  • some stable identifier of the file itself that will allow the NFS server to quickly look it up, traditionally the inode number.

  • a generation count for the inode number, to detect when a NFS filehandle refers to an older version of an inode.

The generation count is needed because traditional Unix filesystems can reuse the inode and thus the inode number of a deleted file, and when a file is deleted you want all the old NFS filehandles for it to stop working instead of suddenly referring to whatever new file inherits the inode. So it is the combination of the inode number and the generation count that uniquely identifies an abstract 'file' in a particular NFS-exported filesystem.

(Of course this is only necessary if your filesystem reuses its equivalent of inode numbers. If it does not, you don't need generation counts.)

Historically, bugs about when inode generation counts did or did not get updated have been a fruitful source of peculiar NFS problems. For example, I believe that there was once a Unix system that updated the generation count when you truncated a file.

unix/NFSFilehandleInternals written at 01:02:25; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.