NFS hard mounts versus soft mounts

November 9, 2014

On most Unix systems NFS mounts come in your choice of two flavours, hard or soft. The Linux nfs manpage actually has a very good description of the difference; the short summary is that a hard NFS mount will keep trying NFS operations endlessly until the server responds while a soft NFS mount will give up and return errors after a while.

You can find people with very divergent opinions about which is better (cf, 2). My opinion is fairly strongly negative about soft mounts. The problem is that it is routine for a loaded NFS server to not respond to client requests within the client timeout interval because the timeout is not for the NFS server to receive the request, it's for the server to fully process it. As you might imagine, a server under heavy IO and network load may not be able to finish your disk IO for some time, especially if it's write IO. This makes NFS timeouts that would trigger soft NFS mount errors a relatively routine event in many real world environments.

(On Linux, any time a client reports 'nfs: server X not responding, still trying' that would be an IO error on a soft NFS mount. In our fileserver environment, some of these happen nearly every day.)

Many Unix programs do not really expect their IO to fail. Even programs that do notice IO errors often don't and can't do anything more than print an error message and perhaps abort. This is not a helpful response to transient errors, but then Unix programs are generally not really designed for a world with routine transient IO errors. Even when programs report the situation, users may not notice or may not be prepared to do very much except, perhaps, retry the operation.

(Write errors are especially dangerous because they can easily cause you to permanently lose data, but even read errors will cause you plenty of heartburn.)

Soft NFS mounts primarily make sense when you have some system that absolutely must remain responsive and cannot delay for too long for any reason. In this case a random but potentially very long kernel imposed delay is a really bad thing and you'd rather have the operation error out entirely so that your user level code can take action and at least respond in some way. Some NFS clients (or just specific NFS mounts) are only used in this way, for a custom system, and are not exposed to general use and general users.

(IO to NFS hard mounts can still be interrupted if you've sensibly mounted them with the intr option. It just requires an explicit decision at user level that the operation should be aborted, instead of the kernel deciding that all operations that have taken 'too long' should be aborted.)

PS: My bias here is that I've always been involved in running general use NFS clients, ones where random people will be using the NFS mounts for random and varied things with random and varied programs of very varied quality. This is basically a worst case for NFS soft mounts.


Comments on this page:

By Albert at 2014-11-10 05:59:27:

From man 5 nfs:

intr / nointr  This option is provided for backward compatibility.  It is ignored after kernel 2.6.25.

from the same page of a slightly older machine:

intr / nointr  Selects whether to allow signals to interrupt file operations on this mount point. If neither option is specified (or if nointr is specified), signals do not
               interrupt NFS file operations. If intr is specified, system calls return EINTR if an in-progress NFS operation is interrupted by a signal.

               Using the intr option is preferred to using the soft option because it is significantly less likely to result in data corruption.

               The  intr  /  nointr  mount option is deprecated after kernel 2.6.25.  Only SIGKILL can interrupt a pending NFS operation on these kernels, and if specified,
               this mount option is ignored to provide backwards compatibility with older kernels.

So it seems the intr option isn't the solution to all problems.

By Andrew R at 2014-11-10 12:23:01:

On the other hand, it is better to lose some data than all of it.

I once had a large rotating set of users where user facing systems were 1.) physically near the users, and 2.) utilized by remote users. On every NFS lock up the nearby users would power cycle the systems, killing everything for everyone. It was normal for a small issue on the NFS server to result in all user facing systems in a room to be power cycled via power plug.

Soft mounting NFS kept more of my users happier; at the risk of a handful of very unhappy users.

Written on 09 November 2014.
« What you're saying when you tell people to send in patches
What it took to get DWiki running under Python 3 »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 9 00:41:15 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.