Why file and directory operations are synchronous in NFS

July 22, 2019

One of the things that unpleasantly surprises people about NFS every so often is that file and directory operations like creating a file, renaming it, or removing it are synchronous. This can make operations like unpacking a tar file or doing a VCS clone or checkout be startlingly slow, much slower than they are on a local filesystem. Even removing a directory tree can be drastically slower than it is locally.

(Anything that creates files also suffers from the issue that NFS clients normally force a flush to disk after they finish writing a file.)

In the original NFS, all writes were synchronous. This was quite simple but also quite slow, and for NFS v3, the protocol moved to a more complicated scheme for data writes, where the majority of data writes could be asynchronous but the client could force the server to flush them all to disk every so often. However, even in NFS v3 the protocol more or less requires that directory level operations are synchronous. You might wonder why.

One simple answer is that the Unix API provides no way to report delayed errors for file and directory operations. If you write() data, it is an accepted part of the Unix API that errors stemming from that write may not be reported until much later, such as when you close() the file. This includes not just 'IO error' type errors, but also problems such as 'out of space' or 'disk quota exceeded'; they may only appear and become definite when the system forces the data to be written out. However, there's no equivalent of close() for things like removing files or renaming them, or making directories; the Unix API assumes that these either succeed or fail on the spot.

(Of course, the Unix API doesn't necessarily promise that all errors are reported at close() and that close() flushes your data to disk. But at least close() explicitly provides the API a final opportunity to report that some errors happened somewhere, and thus allows it to not report all errors at write()s.)

This lack in the Unix API means that it's pretty dangerous for a kernel to accept such operations without actually committing them; if something goes wrong, there's no way to report the problem (and often no process left to report them to). It's especially dangerous in a network filesystem, where the server may crash and reboot without programs on the client noticing (there's no Unix API for that either). It would be very disconcerting if you did a VCS checkout, started working, had everything stall for a few minutes (as the server crashed and came back), and then suddenly all of your checkout was different (because the server hadn't committed it).

You could imagine a network filesystem where the filesystem protocol itself said that file and directory operations were asynchronous until explicitly committed, like NFS v3 writes. But since the Unix API has no way to expose this to programs, the client kernel would just wind up making those file and directory operations synchronous again so that it could immediately report any and all errors when you did mkdir(), rename(), unlink(), or whatever. Nor could the client kernel really batch up a bunch of those operations and send them off to the network filesystem server as a single block; instead it would need to send them one by one just to get them registered and get an initial indication of success or failure (partly because programs often do inconvenient things like mkdir() a directory and then immediately start creating further things in it).

Given all of this, it's not surprising that neither the NFS protocol nor common NFS server implementations try to change the situation. With no support from the Unix API, NFS clients will pretty much always send NFS file and directory operations to the server as they happen and need an immediate reply. In order to avoid surprise client-visible rollbacks, NFS servers are then more or less obliged to commit these metadata changes as they come in, before they send back the replies. The net result is a series of synchronous operations; the client kernel has to send the NFS request and wait for the server reply before it returns from the system call, and the server has to commit before it sends out its reply.

(In the traditional Unix way, some kernels and some filesystems do accept file and metadata operations without committing them. This leads to problems. Generally, though, the kernel makes it so that your operations will only fail due to a crash or an actual disk write error, both of which are pretty uncommon, not due to other delayed issues like 'out of disk space' or 'permission denied (when I got around to checking)'.)

Written on 22 July 2019.
« Why we're going to be using Certbot as our new Let's Encrypt client
ZFS pool imports happen in two stages of pool configuration processing »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jul 22 21:10:08 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.