Wandering Thoughts archives

2020-11-16

POSIX write() is not atomic in the way that you might like

I was recently reading Evan Jones' Durability: Linux File APIs. In this quite good article, I believe that Jones makes a misstep about what you can assume about write() (both in POSIX and in practice). I'll start with a quote from the article:

The write system call is defined in the IEEE POSIX standard as attempting to write data to a file descriptor. After it successfully returns, reads are required to return the bytes that were written, even when read or written by other processes or threads (POSIX standard write(); Rationale). There is an addition note under Thread Interactions with Regular File Operations that says "If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them." This suggests that all file I/O must effectively hold a lock.

Does that mean write is atomic? Technically, yes: future reads must return the entire contents of the write, or none of it. [...]

Unfortunately, that writes are atomic in general is not what POSIX is saying and even if POSIX tried to say it, it's extremely likely that no Unix system would actually comply and deliver fully atomic writes. First off, POSIX's explicit statements about atomicity apply only in two situations: when anything is writing to a pipe or a FIFO, or when there are multiple threads in the same process all performing operations. What POSIX says about writes interleaved with reads is much more limited, so let me quote it (emphasis mine):

After a write() to a regular file has successfully returned:

  • Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.

This does not require any specific behavior for read()s on files that are started by another process before the write() returns (including ones started before the write() began). If you issue such a read(), POSIX allows it to see none, some, or all of the data from the write(). Such a read() is only (theoretically) atomic if you issue it from another thread within the same process. This definitely doesn't provide the usual atomicity property that everyone sees either all of an operation or none of it, since a cross process read() performed during the write() is allowed to see partial results. We would not call a SQL database that allowed you to see partially complete transactions 'atomic', but that is what POSIX allows for write() to files.

(It is also what real Unixes almost certainly provide in practice, although I haven't tested this and there are many situations. For instance, I wouldn't be surprised if aligned, page-sized writes (or filesystem block sized ones) were atomic in practice on many Unixes.)

If we think about what it would take to implement atomic file writes across processes, this should be unsurprising. Since Unix programs don't expect short writes on files, we can't make the problem simpler by limiting how large a write we have to make atomic and then capping write() to that size; people can ask us to write megabytes or even gigabytes in a single write() call and that would have to be atomic. This is too much data to be handled by gathering it into an internal kernel buffer and then flipping the visible state of that section of the file in one action. Instead this would likely require byte range locking, where write() and read() lock against each other where their ranges overlap. This is already a lot of locking activity, since every write() and every read() would have to participate.

(You could optimize read() from a file that no one has open for writing, but then it would still need to lock the file so that it can't be opened for writing until the read() completes.)

But merely locking against read() is not good enough on modern Unixes, because many programs actually read data by mmap()'ing files. If you really want write() to be usefully atomic, you must make these memory mapped reads lock against write() as well, which requires relatively expensive page table manipulation. Worrying about mmap() also exposes a related issue, which is that when people read through memory mapping, write() isn't necessarily atomic even at the level of individual pages of memory. A reader using mapped memory may see a page that's half-way through the kernel's write() copying bytes into it.

(This may happen even with read() and write(), since they may both access the same page of data from the file in the kernel's buffer cache, but it is probably easier to lock things there.)

On top of the performance issues, there are fairness issues. If write() is atomic against read(), a long write() or a long read() can stall the other side for potentially significant amounts of time. People do not enjoy slow and delayed read() and write() operations. This also provides a handy way to DoS writers of files that you can open for reading; simply set up to read() the entire file in one go (or as few as possible) over and over again.

However, much of these costs are because we want cross process atomic write()s, which means that the kernel must be the one doing the locking work. Cross thread atomic write() can be implemented entirely at user level within a single process (provided that the C library intercepts read() and write() operations when threading is active). In a lot of cases you can get away with some sort of simple whole file locking, although the database people will probably not be happy with you. Fairness and stalls are also much less of an issue within a single process, because the only person you're hurting is yourself.

(Most programs do not read() and write() from the same file at the same time in two threads.)

PS: Note that even writes to pipes and FIFOs are only atomic if they are small enough; large writes explicitly don't have to be atomic (and generally aren't on real Unixes). It would be rather unusual for POSIX to specify limited size atomicity for pipes and unlimited size atomicity for regular files.

PPS: I would be wary of assuming that any particular Unix actually fully implemented atomic read() and write() between threads. Perhaps I'm being cynical, but I would test it first; it seems like the kind of picky POSIX requirement that people would cut out in the name of simplicity and speed.

unix/WriteNotVeryAtomic written at 23:40:15; Add Comment

Unix doesn't normally do short write()s to files and no one expects it to

A famous issue in handling network IO on Unix is that write() may not send all of your data; you will try to write() 16 KB of data, and the result will tell you that you only actually wrote 4 KB. Failure to handle this case leads to mysteriously lost data, where your sending program thinks it sent all 16 KB but of course the receiver only saw 4 KB. It's very common for people writing network IO libraries on Unix to provide a 'WriteAll' or 'SendAll' operation, or sometimes make it the default behavior.

(Go's standard Write() interface requires full writes unless there was an error, for example.)

In theory the POSIX specification for write() allows it to perform short writes on anything (without an error), not just network sockets, pipes, and FIFOs. In particular it is allowed to do them for regular files, and POSIX even documents some situations where this may happen (for example, if the process received a signal part way through the write() call). In practice, Unixes do not normally do short write()s to files without an error occurring, outside of the special case of a write() being interrupted by a signal that doesn't kill the process outright.

(If the process dies on the spot, there is no write() return value.)

In theory, because it's possible, every Unix program that write()s to a file should be prepared to handle short writes. In practice, since it doesn't really happen, many Unix programs are almost certainly not prepared to handle it. If you (and they) are lucky, these programs check that the return value of the write() is the amount of data they wrote and error out otherwise. Otherwise, they may ignore the write() return value and cheerfully sail on with data lost. Of course they don't actually error out or lose data in practice, because short write()s don't really happen on files.

(Some sorts of programs are generally going to be okay because they are already very careful about data loss. I would expect any good editor to be fine, for example, or at least to report an error.)

This difference between theory and practice means that it would be pretty dangerous to introduce a Unix environment that did routinely have short writes to files (whether it was a new Unix kernel or, say, a peculiar filesystem). This environment would be technically correct and it would be uncovering theoretical issues in programs, but it would probably not be useful.

PS: Enterprising parties could arrange to test this with their favorite programs through a loadable shared library that intercepts write() and shortens the write size. I suspect that you could get an interesting undergraduate Computer Science paper out of it.

unix/WritesNotShortOften written at 00:12:50; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.