POSIX write()
is not atomic in the way that you might like
I was recently reading Evan Jones' Durability: Linux File APIs. In this quite
good article, I believe that Jones makes a misstep about what you can
assume about write()
(both in POSIX and in practice). I'll start
with a quote from the article:
The write system call is defined in the IEEE POSIX standard as attempting to write data to a file descriptor. After it successfully returns, reads are required to return the bytes that were written, even when read or written by other processes or threads (POSIX standard write(); Rationale). There is an addition note under Thread Interactions with Regular File Operations that says "If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them." This suggests that all file I/O must effectively hold a lock.
Does that mean write is atomic? Technically, yes: future reads must return the entire contents of the write, or none of it. [...]
Unfortunately, that writes are atomic in general is not what POSIX is saying and even if POSIX tried to say it, it's extremely likely that no Unix system would actually comply and deliver fully atomic writes. First off, POSIX's explicit statements about atomicity apply only in two situations: when anything is writing to a pipe or a FIFO, or when there are multiple threads in the same process all performing operations. What POSIX says about writes interleaved with reads is much more limited, so let me quote it (emphasis mine):
After a
write()
to a regular file has successfully returned:
- Any successful
read()
from each byte position in the file that was modified by that write shall return the data specified by thewrite()
for that position until such byte positions are again modified.
This does not require any specific behavior for read()
s on files
that are started by another process before the write()
returns
(including ones started before the write()
began). If you issue
such a read()
, POSIX allows it to see none, some, or all of the
data from the write()
. Such a read()
is only (theoretically)
atomic if you issue it from another thread within the same process.
This definitely doesn't provide the usual atomicity property that
everyone sees either all of an operation or none of it, since a
cross process read()
performed during the write()
is allowed
to see partial results. We would not call a SQL database that
allowed you to see partially complete transactions 'atomic', but
that is what POSIX allows for write()
to files.
(It is also what real Unixes almost certainly provide in practice, although I haven't tested this and there are many situations. For instance, I wouldn't be surprised if aligned, page-sized writes (or filesystem block sized ones) were atomic in practice on many Unixes.)
If we think about what it would take to implement atomic file writes
across processes, this should be unsurprising. Since Unix programs
don't expect short writes on files, we can't
make the problem simpler by limiting how large a write we have to
make atomic and then capping write()
to that size; people can ask
us to write megabytes or even gigabytes in a single write()
call
and that would have to be atomic. This is too much data to be handled
by gathering it into an internal kernel buffer and then flipping
the visible state of that section of the file in one action. Instead
this would likely require byte range locking, where write()
and
read()
lock against each other where their ranges overlap. This
is already a lot of locking activity, since every write()
and every
read()
would have to participate.
(You could optimize read()
from a file that no one has open for
writing, but then it would still need to lock the file so that it
can't be opened for writing until the read()
completes.)
But merely locking against read()
is not good enough on modern
Unixes, because many programs actually read data by mmap()
'ing
files. If you really want write()
to be usefully atomic, you must
make these memory mapped reads lock against write()
as well, which
requires relatively expensive page table manipulation. Worrying
about mmap()
also exposes a related issue, which is that when
people read through memory mapping, write()
isn't necessarily
atomic even at the level of individual pages of memory. A reader
using mapped memory may see a page that's half-way through the
kernel's write()
copying bytes into it.
(This may happen even with read()
and write()
, since they may
both access the same page of data from the file in the kernel's
buffer cache, but it is probably easier to
lock things there.)
On top of the performance issues, there are fairness issues. If
write()
is atomic against read()
, a long write()
or a long
read()
can stall the other side for potentially significant
amounts of time. People do not enjoy slow and delayed read()
and write()
operations. This also provides a handy way to DoS
writers of files that you can open for reading; simply set up to
read()
the entire file in one go (or as few as possible) over
and over again.
However, much of these costs are because we want cross process
atomic write()
s, which means that the kernel must be the one doing
the locking work. Cross thread atomic write()
can be implemented
entirely at user level within a single process (provided that the
C library intercepts read()
and write()
operations when threading
is active). In a lot of cases you can get away with some sort of
simple whole file locking, although the database people will probably
not be happy with you. Fairness and stalls are also much less of an
issue within a single process, because the only person you're hurting
is yourself.
(Most programs do not read()
and write()
from the same file at
the same time in two threads.)
PS: Note that even writes to pipes and FIFOs are only atomic if they are small enough; large writes explicitly don't have to be atomic (and generally aren't on real Unixes). It would be rather unusual for POSIX to specify limited size atomicity for pipes and unlimited size atomicity for regular files.
PPS: I would be wary of assuming that any particular Unix actually
fully implemented atomic read()
and write()
between threads.
Perhaps I'm being cynical, but I would test it first; it seems like
the kind of picky POSIX requirement that people would cut out in
the name of simplicity and speed.
Comments on this page:
|
|