POSIX write() is not atomic in the way that you might like

November 16, 2020

I was recently reading Evan Jones' Durability: Linux File APIs. In this quite good article, I believe that Jones makes a misstep about what you can assume about write() (both in POSIX and in practice). I'll start with a quote from the article:

The write system call is defined in the IEEE POSIX standard as attempting to write data to a file descriptor. After it successfully returns, reads are required to return the bytes that were written, even when read or written by other processes or threads (POSIX standard write(); Rationale). There is an addition note under Thread Interactions with Regular File Operations that says "If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them." This suggests that all file I/O must effectively hold a lock.

Does that mean write is atomic? Technically, yes: future reads must return the entire contents of the write, or none of it. [...]

Unfortunately, that writes are atomic in general is not what POSIX is saying and even if POSIX tried to say it, it's extremely likely that no Unix system would actually comply and deliver fully atomic writes. First off, POSIX's explicit statements about atomicity apply only in two situations: when anything is writing to a pipe or a FIFO, or when there are multiple threads in the same process all performing operations. What POSIX says about writes interleaved with reads is much more limited, so let me quote it (emphasis mine):

After a write() to a regular file has successfully returned:

  • Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.

This does not require any specific behavior for read()s on files that are started by another process before the write() returns (including ones started before the write() began). If you issue such a read(), POSIX allows it to see none, some, or all of the data from the write(). Such a read() is only (theoretically) atomic if you issue it from another thread within the same process. This definitely doesn't provide the usual atomicity property that everyone sees either all of an operation or none of it, since a cross process read() performed during the write() is allowed to see partial results. We would not call a SQL database that allowed you to see partially complete transactions 'atomic', but that is what POSIX allows for write() to files.

(It is also what real Unixes almost certainly provide in practice, although I haven't tested this and there are many situations. For instance, I wouldn't be surprised if aligned, page-sized writes (or filesystem block sized ones) were atomic in practice on many Unixes.)

If we think about what it would take to implement atomic file writes across processes, this should be unsurprising. Since Unix programs don't expect short writes on files, we can't make the problem simpler by limiting how large a write we have to make atomic and then capping write() to that size; people can ask us to write megabytes or even gigabytes in a single write() call and that would have to be atomic. This is too much data to be handled by gathering it into an internal kernel buffer and then flipping the visible state of that section of the file in one action. Instead this would likely require byte range locking, where write() and read() lock against each other where their ranges overlap. This is already a lot of locking activity, since every write() and every read() would have to participate.

(You could optimize read() from a file that no one has open for writing, but then it would still need to lock the file so that it can't be opened for writing until the read() completes.)

But merely locking against read() is not good enough on modern Unixes, because many programs actually read data by mmap()'ing files. If you really want write() to be usefully atomic, you must make these memory mapped reads lock against write() as well, which requires relatively expensive page table manipulation. Worrying about mmap() also exposes a related issue, which is that when people read through memory mapping, write() isn't necessarily atomic even at the level of individual pages of memory. A reader using mapped memory may see a page that's half-way through the kernel's write() copying bytes into it.

(This may happen even with read() and write(), since they may both access the same page of data from the file in the kernel's buffer cache, but it is probably easier to lock things there.)

On top of the performance issues, there are fairness issues. If write() is atomic against read(), a long write() or a long read() can stall the other side for potentially significant amounts of time. People do not enjoy slow and delayed read() and write() operations. This also provides a handy way to DoS writers of files that you can open for reading; simply set up to read() the entire file in one go (or as few as possible) over and over again.

However, much of these costs are because we want cross process atomic write()s, which means that the kernel must be the one doing the locking work. Cross thread atomic write() can be implemented entirely at user level within a single process (provided that the C library intercepts read() and write() operations when threading is active). In a lot of cases you can get away with some sort of simple whole file locking, although the database people will probably not be happy with you. Fairness and stalls are also much less of an issue within a single process, because the only person you're hurting is yourself.

(Most programs do not read() and write() from the same file at the same time in two threads.)

PS: Note that even writes to pipes and FIFOs are only atomic if they are small enough; large writes explicitly don't have to be atomic (and generally aren't on real Unixes). It would be rather unusual for POSIX to specify limited size atomicity for pipes and unlimited size atomicity for regular files.

PPS: I would be wary of assuming that any particular Unix actually fully implemented atomic read() and write() between threads. Perhaps I'm being cynical, but I would test it first; it seems like the kind of picky POSIX requirement that people would cut out in the name of simplicity and speed.

Comments on this page:

By kib at 2020-11-17 12:09:26:

FreeBSD takes range lock over the read/written/truncated range for regular files. Of course mmaped reads are not locked.

Written on 16 November 2020.
« Unix doesn't normally do short write()s to files and no one expects it to
Grafana and the case of the infinite serial number »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Nov 16 23:40:15 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.