Why true asynchronous file IO is hard, at least for reads

September 4, 2011

Although I believe that this may now have changed, Linux for a long time didn't fully support asynchronous file read IO although it supported some sorts of other async IO. One reason for this is that doing genuine fully asynchronous file IO is a lot more complicated than it looks.

In theory you might think that this is an easy feature to add to a filesystem. After all, the kernel can already do asynchronous reads of a data block; it's just that the normal path for read() issues the asynchronous read and then immediately waits for it to complete. So it looks like you should basically be able to add a flag that propagates down through that code to say 'don't wait, signal here when the read completes' and be mostly done (you also need to translate the notification to user space and so on).

In practice the problem is not the data block read itself, but getting to the point where you can issue the data block read. In order to know where to read the data block from, you need to consult various bits of file metadata, like indirect blocks. Which may mean reading them off disk. In traditional filesystem implementations (including ext2 and its descendents), this was implemented with synchronous IO because that's by far the simplest way to write the code. Revising all of this code to work asynchronously is a lot of work and so didn't get done for a while.

(Async IO implemented with threads (well, faked really) doesn't have this problem because the thread wraps all of that up behind your back. Thread based async IO has the problem that now you need a lot of threads and supporting infrastructure.)

By the way, I believe that similar issues came up with write IO. In a sense write IO is easier because the kernel already does a lot of it asynchronously. However, in order to figure out where to place a new data block the kernel may need to do some metadata IO (as I've vividly seen before), and this was historically often done synchronously. Changing filesystems to what I believe is called delayed write allocation also turned up all sorts of interesting corner cases; for example, you need to track filesystem free space synchronously because otherwise you can accept a write that you'll later discover you have no space for.

(Let's not talk about per-user disk space quotas.)

Written on 04 September 2011.
« How some Unixes did shared libraries in the old days
The real reason why true asynchronous file IO is hard »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Sep 4 23:49:16 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.