NFS's problem with (concurrent) writes

October 21, 2013

If you hang around distributed filesystem developers, you may hear them say grumpy things about NFS's handling of concurrent writes and writes in general. If you're an outsider this can be a little bit opaque. I didn't fully remember the details until I was reminded about them recently so in my usual tradition I am going to write down the core problem. To start with I should say that the core problem is with NFS the protocol, not any particular implementation.

Suppose that you have two processes, A and B. A is writing to a file and B is reading from it (perhaps they are cooperating database processes or something). If A and B are running on the same machine, the moment that A calls write() the newly-written data is visible to B when it next does a read() (or it's directly visible if B has the file mmap()'d). Now we put A and B on different machines, sharing access to the file over NFS. Suddenly we have a problem, or actually two problems.

First, NFS is silent on how long A's kernel can hold on to the write() before sending it to the NFS server. If A close()s or fsync()s the file the kernel must ship the writes off to the NFS server, but before then it may hang on to them for some amount of time at its convenience. Second, NFS has no protocol for the server to notify B's kernel that there is updated data in the file. Instead B's kernel may be holding on to what is now old cached data that it will quietly give to B, even though the server has new data. Properly functioning NFS clients check for this when you open() a file (and discard old data if necessary); I believe that they may check at other times but it's not necessarily guaranteed.

The CS way of putting this is that this is a distributed cache invalidation problem and NFS has only very basic support for it. Basically NFS punts and tells you to use higher-level mechanisms to make this work, mechanisms that mean A and B have to be at least a bit NFS-aware. Many modern distributed and cluster filesystems have much more robust support that guarantees processes A and B see a result much closer to what they would if they ran on the same machine (some distributed FSes probably guarantee that it's basically equivalent).

(Apparently one term of art for this is that NFS has only 'close to open' consistency, ie you only get consistent results among a pool of clients if A closes the file before B opens it.)

Comments on this page:

By Anonymous at 2014-04-02 04:00:56:

So if A writes a file to an NFS directory and B needs to read it "immediately" as the file appears, is the only workaround to use low values of actimeo? Or should A and B be communicating directly with some simple mechanism instead of setting, say, actimeo=1?

By cks at 2014-04-17 00:12:23:

It took me a while but I finally have an answer to this in NFSWritePlusReadProblemII. The short answer is that I think the best approach is that A should write files with new names and then communicate directly to B to tell it 'process file <X>'.

From at 2021-05-17 11:49:34:

This is an old entry, but I think it's worth pointing out that NFSv4's leases improve this; clients caching data can hold a lease, and another client wanting to write will first recall leases held.

Written on 21 October 2013.
« Thinking about how I want to test disk IO on an iSCSI backend
Paying for services is not necessarily enough »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Oct 21 23:56:14 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.