2020-11-16
POSIX write()
is not atomic in the way that you might like
I was recently reading Evan Jones' Durability: Linux File APIs. In this quite
good article, I believe that Jones makes a misstep about what you can
assume about write()
(both in POSIX and in practice). I'll start
with a quote from the article:
The write system call is defined in the IEEE POSIX standard as attempting to write data to a file descriptor. After it successfully returns, reads are required to return the bytes that were written, even when read or written by other processes or threads (POSIX standard write(); Rationale). There is an addition note under Thread Interactions with Regular File Operations that says "If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them." This suggests that all file I/O must effectively hold a lock.
Does that mean write is atomic? Technically, yes: future reads must return the entire contents of the write, or none of it. [...]
Unfortunately, that writes are atomic in general is not what POSIX is saying and even if POSIX tried to say it, it's extremely likely that no Unix system would actually comply and deliver fully atomic writes. First off, POSIX's explicit statements about atomicity apply only in two situations: when anything is writing to a pipe or a FIFO, or when there are multiple threads in the same process all performing operations. What POSIX says about writes interleaved with reads is much more limited, so let me quote it (emphasis mine):
After a
write()
to a regular file has successfully returned:
- Any successful
read()
from each byte position in the file that was modified by that write shall return the data specified by thewrite()
for that position until such byte positions are again modified.
This does not require any specific behavior for read()
s on files
that are started by another process before the write()
returns
(including ones started before the write()
began). If you issue
such a read()
, POSIX allows it to see none, some, or all of the
data from the write()
. Such a read()
is only (theoretically)
atomic if you issue it from another thread within the same process.
This definitely doesn't provide the usual atomicity property that
everyone sees either all of an operation or none of it, since a
cross process read()
performed during the write()
is allowed
to see partial results. We would not call a SQL database that
allowed you to see partially complete transactions 'atomic', but
that is what POSIX allows for write()
to files.
(It is also what real Unixes almost certainly provide in practice, although I haven't tested this and there are many situations. For instance, I wouldn't be surprised if aligned, page-sized writes (or filesystem block sized ones) were atomic in practice on many Unixes.)
If we think about what it would take to implement atomic file writes
across processes, this should be unsurprising. Since Unix programs
don't expect short writes on files, we can't
make the problem simpler by limiting how large a write we have to
make atomic and then capping write()
to that size; people can ask
us to write megabytes or even gigabytes in a single write()
call
and that would have to be atomic. This is too much data to be handled
by gathering it into an internal kernel buffer and then flipping
the visible state of that section of the file in one action. Instead
this would likely require byte range locking, where write()
and
read()
lock against each other where their ranges overlap. This
is already a lot of locking activity, since every write()
and every
read()
would have to participate.
(You could optimize read()
from a file that no one has open for
writing, but then it would still need to lock the file so that it
can't be opened for writing until the read()
completes.)
But merely locking against read()
is not good enough on modern
Unixes, because many programs actually read data by mmap()
'ing
files. If you really want write()
to be usefully atomic, you must
make these memory mapped reads lock against write()
as well, which
requires relatively expensive page table manipulation. Worrying
about mmap()
also exposes a related issue, which is that when
people read through memory mapping, write()
isn't necessarily
atomic even at the level of individual pages of memory. A reader
using mapped memory may see a page that's half-way through the
kernel's write()
copying bytes into it.
(This may happen even with read()
and write()
, since they may
both access the same page of data from the file in the kernel's
buffer cache, but it is probably easier to
lock things there.)
On top of the performance issues, there are fairness issues. If
write()
is atomic against read()
, a long write()
or a long
read()
can stall the other side for potentially significant
amounts of time. People do not enjoy slow and delayed read()
and write()
operations. This also provides a handy way to DoS
writers of files that you can open for reading; simply set up to
read()
the entire file in one go (or as few as possible) over
and over again.
However, much of these costs are because we want cross process
atomic write()
s, which means that the kernel must be the one doing
the locking work. Cross thread atomic write()
can be implemented
entirely at user level within a single process (provided that the
C library intercepts read()
and write()
operations when threading
is active). In a lot of cases you can get away with some sort of
simple whole file locking, although the database people will probably
not be happy with you. Fairness and stalls are also much less of an
issue within a single process, because the only person you're hurting
is yourself.
(Most programs do not read()
and write()
from the same file at
the same time in two threads.)
PS: Note that even writes to pipes and FIFOs are only atomic if they are small enough; large writes explicitly don't have to be atomic (and generally aren't on real Unixes). It would be rather unusual for POSIX to specify limited size atomicity for pipes and unlimited size atomicity for regular files.
PPS: I would be wary of assuming that any particular Unix actually
fully implemented atomic read()
and write()
between threads.
Perhaps I'm being cynical, but I would test it first; it seems like
the kind of picky POSIX requirement that people would cut out in
the name of simplicity and speed.
Unix doesn't normally do short write()
s to files and no one expects it to
A famous issue in handling network IO on Unix is that write()
may not
send all of your data; you will try to write()
16 KB of data, and
the result will tell you that you only actually wrote 4 KB. Failure to
handle this case leads to mysteriously lost data, where your sending
program thinks it sent all 16 KB but of course the receiver only saw 4
KB. It's very common for people writing network IO libraries on Unix to
provide a 'WriteAll' or 'SendAll' operation, or sometimes make it the
default behavior.
(Go's standard Write()
interface requires full writes unless there
was an error, for example.)
In theory the POSIX specification for write()
allows it to perform short writes on anything (without an error),
not just network sockets, pipes, and FIFOs. In particular it is
allowed to do them for regular files, and POSIX even documents some
situations where this may happen (for example, if the process
received a signal part way through the write()
call). In practice,
Unixes do not normally do short write()
s to files without an error
occurring, outside of the special case of a write()
being interrupted
by a signal that doesn't kill the process outright.
(If the process dies on the spot, there is no write()
return value.)
In theory, because it's possible, every Unix program that write()
s
to a file should be prepared to handle short writes. In practice,
since it doesn't really happen, many Unix programs are almost
certainly not prepared to handle it. If you (and they) are lucky,
these programs check that the return value of the write()
is the
amount of data they wrote and error out otherwise. Otherwise, they
may ignore the write()
return value and cheerfully sail on with
data lost. Of course they don't actually error out or lose data
in practice, because short write()
s don't really happen on files.
(Some sorts of programs are generally going to be okay because they are already very careful about data loss. I would expect any good editor to be fine, for example, or at least to report an error.)
This difference between theory and practice means that it would be pretty dangerous to introduce a Unix environment that did routinely have short writes to files (whether it was a new Unix kernel or, say, a peculiar filesystem). This environment would be technically correct and it would be uncovering theoretical issues in programs, but it would probably not be useful.
PS: Enterprising parties could arrange to test this with their
favorite programs through a loadable shared library that intercepts
write()
and shortens the write size. I suspect that you could get
an interesting undergraduate Computer Science paper out of it.