Why high availability NFS requires shared storage

January 20, 2009

Suppose that you have a situation where you need transparent high availability NFS for something that is read-only and updated only infrequently. Instead of going to the expense and bother of setting up real shared storage, it's tempting to try to implement this by setting up a number of fileservers with local copies of the filesystem, synchronizing it from a master machine with rsync, and then using a slight variant of the basic HA NFS setup.

Unfortunately, the tempting easy way doesn't work; you can't have transparent high availability unless you have shared storage or something that fakes it very well.

In order to have transparent HA, you need transparent failover. In order to have transparent failover, you need to keep the NFS filehandles the same. In almost all NFS implementations and filesystems, keeping the NFS filehandles the same requires keeping the inode numbers and generation counts of every file and directory exactly the same across all copies of the data.

No user-level tool can do this; there is no Unix interface to set the inode number or the generation count when you create or manipulate a file (okay, this is not quite true; at least some Linux filesystems have a private interface to set the generation count of an inode, although this still doesn't help with the inode numbers). So the inevitable conclusion is that you must replicate your filesystem at some level below the normal user level one that rsync and company use.

The most general solution is shared storage, where you don't have to replicate anything at all. If you absolutely can't do shared storage, I can think of two general alternatives: synchronize the raw disk instead of the filesystem (possibly still with rsync), or do cross-network disk mirroring of some sort between the master and the replicas.

(Plausible methods of cross-network disk mirroring include Linux's DRBD and using an iSCSI or AOE target implementation on the replicas to export raw disks to the master.)

Comments on this page:

From at 2009-01-21 09:02:46:

Note that the "raw disks" you rsync don't really have to be raw disks per se - they can just as easily be large files that are mounted through a loopback device, possibly themselves sitting on a RAID array of some type.

Of course, at this point I really think you've reached the level of ridiculousness. I can't imagine a scenario in which it's desirable to have HA NFS with failover where it wouldn't be just as desirable - and substantially less complicated - to do the failover at the application level.

-- DanielMartin, whose left his password somewhere at home

By cks at 2009-02-05 11:42:57:

I think that the attraction of HA NFS of this sort is that it solves the problem once, centrally, instead of having to solve it repeatedly in a bunch of different applications. It may also be simpler (especially overall) to solve it at the filesystem level, instead of forcing each application to think about failover and possibly replication.

(I note that the Google filesystem does something like this; it supplies reliable storage to applications, handling the issues involved itself.)

Written on 20 January 2009.
« The inner life of NFS filehandles
The NFS re-export problem »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jan 20 18:26:42 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.