Complications in supporting 'append to a file' in a NFS server

November 7, 2024

In the comments of my entry on the general problem of losing network based locks, an interesting side discussion has happened between commentator abel and me over NFS servers (not) supporting the Unix O_APPEND feature. The more I think about it, the more I think it's non-trivial to support well in an NFS server and that there are some subtle complications (and probably more than I haven't realized). I'm mostly going to restrict this to something like NFS v3, which is what I'm familiar with.

The basic Unix semantics of O_APPEND are that when you perform a write(), all of your data is immediately and atomically put at the current end of the file, and the file's size and maximum offset are immediately extended to the end of your data. If you and I do a single append write() of 128 Mbytes to the same file at the same time, either all of my 128 Mbytes winds up before yours or vice versa; your and my data will never wind up intermingled.

This basic semantics is already a problem for NFS because NFS (v3) connections have a maximum size for single NFS 'write' operations and that size may be (much) smaller than the user level write(). Without a multi-operation transaction of some sort, we can't reliably perform append write()s of more data than will fit in a NFS write operation; either we fail those 128 Mbyte writes, or we have the possibility that data from you and I will be intermingled in the file.

In NFS v2, all writes were synchronous (or were supposed to be, servers sometimes lied about this). NFS v3 introduced the idea of asynchronous, buffered writes that were later committed by clients. NFS servers are normally permitted to discard asynchronous writes that haven't yet been committed by the client; when the client tries to commit them later, the NFS server rejects the commit and the client resends the data. This works fine when the client's request has a definite position in the file, but it has issues if the client's request is a position-less append write. If two clients do append writes to the same file, first A and then B after it, the server discards both, and then client B is the first one to go through the 'COMMIT, fail, resend' process, where does its data wind up? It's not hard to wind up with situations where a third client that's repeatedly reading the file will see inconsistent results, where first it sees A's data then B's and then later either it sees B's data before A's or B's data without anything from A (not even a zero-filled gap in the file, the way you'd get with ordinary writes).

(While we can say that NFS servers shouldn't ever deliberately discard append writes, one of the ways that this happens is that the server crashes and reboots.)

You can get even more fun ordering issues created by retrying lost writes if there is another NFS client involved that is doing manual append writes by finding out the current end of file and writing at it. If A and B do append writes, C does a manual append write, all writes are lost before they're committed, B redoes, C redoes, and then A redoes, a natural implementation could easily wind up with B's data, an A data sized hole, C's data, and then A's data appended after C's.

This also creates server side ordering dependencies for potentially discarding uncommitted asynchronous write data, ones that a NFS server can normally make independently. If A appended a lot of data and then B appended a little bit, you probably don't want to discard A's data but not B's, because there's no guarantee that A will later show up to fail a COMMIT and resend it (A could have crashed, for example). And if B requests a COMMIT, you probably want to commit A's data as well, even if there's much more of it.

One way around this would be to adopt a more complex model of append writes over NFS, where instead of the client requesting an append write, it requests 'write this here but fail if this is not the current end of file'. This would give all NFS writes a definite position in the file at the cost of forcing client retries on the initial request (if the client later has to repeat the write because of a failed commit, it must carefully strip this flag off). Unfortunately a file being appended to from multiple clients at a high rate would probably result in a lot of client retries, with no guarantee that a given client would ever actually succeed.

(You could require all append writes to be synchronous, but then this would do terrible things to NFS server performance for potentially common use of append writes, like appending log lines to a shared log file from multiple machines. And people absolutely would write and operate programs like that if append writes over NFS were theoretically reliable.)


Comments on this page:

By abel at 2024-11-08 11:11:23:

Thanks, Chris. I agree that huge writes make things tricky, and that's something I'd failed to consider. But that made me wonder how they'd work even locally.

The basic Unix semantics of O_APPEND are that when you perform a write(), all of your data is immediately and atomically put at the current end of the file, and the file's size and maximum offset are immediately extended to the end of your data.

Are you sure about that? Here's what POSIX-2024 says for write(): "If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation." (I grepped the whole spec; the only hit with more detail relates to asynchronous I/O, and I don't consider it really relevant to this discussion.)

A write operation, though, is only required to atomically handle "all your data" under specific circumstances. Pipes, when writing PIPE_BUF bytes or less. SOCK_DGRAM sockets if connect() has been used. (Also SOCK_SEQPACKET if the data will fit, but you'd have to see recvmsg() to know that, because send() talks of "records" rather than messages.)

As for regular files? POSIX requires the seek and the write operation to be atomic when O_APPEND is set, but there's nothing about the write operation taking "all your data". A perverse system could probably decide never to take more than 1 byte at a time; write() would return 1, and as long as that one byte went to the (then) end of the file, I don't think there'd be a compliance problem. POSIX documents a number of reasons for short writes, such as signal interruption and lack of space, but I don't see any limitation on other reasons; it explicitly says "every application should be prepared to handle partial writes on other [than pipe/FIFO] kinds of file descriptors".

That said, there's apparently a long-standing Unix tradition of only truncating writes on block boundaries, and I know tar will break if one does otherwise (my recollection is that it stops reading a file as soon as read() returns less than 512). I think this gives us a solution, anyway: truncate at NFS's "maximum write size" when O_APPEND is set. I doubt the people making this an "NFS FAQ" are complaining about their hundred-megabyte writes getting interleaved. It's probably mostly the type of line-by-line appending, mentioned in your previous post, where people are noticing this.

Written on 07 November 2024.
« Losing NFS locks and the SunOS SIGLOST signal
Maybe skipping 'Dependabot' commits when using 'git log' »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Thu Nov 7 23:14:19 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.