Disk write buffering and its interactions with write flushes

March 17, 2024

Pretty much every modern system defaults to having data you write to filesystems be buffered by the operating system and only written out asynchronously or when you specially request for it to be flushed to disk, which gives you general questions about how much write buffering you want. Now suppose, not hypothetically, that you're doing write IO that is pretty much always going to be specifically flushed to disk (with fsync() or the equivalent) before the programs doing it consider this write IO 'done'. You might get this situation where you're writing and rewriting mail folders, or where the dominant write source is updating a write ahead log.

In this situation where the data being written is almost always going to be flushed to disk, I believe the tradeoffs are a bit different than in the general write case. Broadly, you can never actually write at a rate faster than the write rate of the underlying storage, since in the end you have to wait for your write data to actually get to disk before you can proceed. I think this means that you want the OS to start writing out data to disk almost immediately as your process writes data; delaying the write out will only take more time in the long run, unless for some reason the OS can write data faster when you ask for the flush than before then. In theory and in isolation, you may want these writes to be asynchronous (up until the process asks for the disk flush, where you have to synchronously wait for them), because the process may be able to generate data faster if it's not stalling waiting for individual writes to make it to disk.

(In OS tuning jargon, we'd say that you want writeback to start almost immediately.)

However, journaling filesystems and concurrency add some extra complications. Many journaling filesystems have the journal as a central synchronization point, where only one disk flush can be in progress at once and if several processes ask for disk flushes at more or less the same time they can't proceed independently. If you have multiple processes all doing write IO that they will eventually flush and you want to minimize the latency that processes experience, you have a potential problem if different processes write different amounts of IO. A process that asynchronously writes a lot of IO and then flushes it to disk will obviously have a potentially long flush, and this flush will delay the flushes done by other processes writing less data, because everything is running through the chokepoint that is the filesystem's journal.

In this situation I think you want the process that's writing a lot of data to be forced to delay, to turn its potentially asynchronous writes into more synchronous ones that are restricted to the true disk write data rate. This avoids having a large overhang of pending writes when it finally flushes, which hopefully avoids other processes getting stuck with a big delay as they try to flush. Although it might be ideal if processes with less write volume could write asynchronously, I think it's probably okay if all of them are forced down to relatively synchronous writes with all processes getting an equal fair share of the disk write bandwidth. Even in this situation the processes with less data to write and flush will finish faster, lowering their latency.

To translate this to typical system settings, I believe that you want to aggressively trigger disk writeback and perhaps deliberately restrict the total amount of buffered writes that the system can have. Rather than allowing multiple gigabytes of outstanding buffered writes and deferring writeback until a gigabyte or more has accumulated, you'd set things to trigger writebacks almost immediately and then force processes doing write IO to wait for disk writes to complete once you have more than a relatively small volume of outstanding writes.

(This is in contrast to typical operating system settings, which will often allow you to use a relatively large amount of system RAM for asynchronous writes and not aggressively start writeback. This especially would make a difference on systems with a lot of RAM.)


Comments on this page:

By George Spelvin at 2024-03-18 10:31:56:

It seems to me that there's a simpler solution: When fsync()ing, empty the buffers before synchronizing with other journal writes.

Just to state explicitly what's implicit in what you wrote, there's a difference between the data being on disk, and the associated metadata being written. The second part is the "commit" which makes the write durable.

All file systems have this distinction, but journaling file systems make commits global, so you have more interference between writers.

Writing n blocks of data takes O(n) time, while the metadata commit is, if not quite O(log n), at least o(n). Large commits aren't themselves a prospect to be feared.

Keeping an overhang in RAM is useful if we have enough buffer space to absorb the write and we won't be synchronizing the write so can move on while the OS completes the writes asynchronously.

Given modern RAM sizes, the former threshold is quite generous, but we still need heuristics. It's annoying when one massive writer eats all the available RAM, stalling a lot of other smaller writers which could otherwise have proceeded asynchronously.

But I don't see why we need to make heuristic guesses at the second.

Rather, divide fsync() operations into two phases:

  1. Writing out the data
  2. Committing the metadata

The important part of this idea is that phase 1 does not block journal commits. Multiple other writers may force a journal commit while this lengthy preliminary is in progress. Only once it's on disk do we need to proceed to the associated global journal commit, which requires synchronization with other writers, but is never huge.

Rather than the awkward heuristic of saying "I suspect this process will want to sync its writes, so let's minimize RAM buffering", you wait until you have an fsync() call which tells you unambiguously. But then you flush the buffers without blocking other syncs, just like you would have done had your heuristic triggered on the initial write() call, until the final o(n) metadata update.

By cks at 2024-03-18 16:39:25:

I don't think there's any fundamental obstacle to a filesystem making it so that committing the journal isn't a choke point. But at the same time I don't think very many do it, and I think it's probably easier to implement it as basically a single-threaded process. If you implement journal commit as a concurrent process you need to carefully keep various things separate even if they'd normally be mingled together (for example, allocating new space for new data blocks).

As it turns out fsync(2) in Linux does work the way George Spelvin has suggested. We first will issue writes to the data blocks (of the file being fsync'ed), and then secondly do we commit the metadata. So while the first phase is going on, other writers who are also calling fsync(2) won't have their commits blocked.... so long as all of the writes involved are writing to blocks that have already been allocated, so we are overwriting existing data blocks. This is often the case in the database world, for example.

Unfortunately things get a bit more complicated when block allocations are involved, since these necessarily require metadata changes --- and most of the time, we don't want the previous counts of data blocks to be unmasked the system crashes. So in ext4's data=ordered mode, if you have a a large file --- say, a large file that is freshly created containing, say, an image of a 4GB DVD rip, which is being fsync'ed, while we are writing out the data blocks for that large file, we need to allocate the data block for this new file, and these allocations require making global changes to the file system metadata. Now suppose someone writes a small file, and calls fsync(2) on that small file. When we commit the metadata blocks, which include the block allocations for the large file and the small file, we need to make sure all of the data blocks that have been assigned to the large file are written out, or else on a crash, so that we don't risk exposing stale data from a previous file to an unauthorized user. And this where the entangelment comes from.

Now, if you don't care about stale data being potentially exposed after a crash, you can just mount the file system in data=write back mode, which will avoid the entangelment, and so long as your system never crashes or you don't care about exposing stale data, you're golden.

If you do care about this, then it is still soluble, but requires a lot more complexity in the file system. Because now what you need to do is to allocate block for the new file, but (a) not actually make the changes in the file system's global metadata , and (b) make sure that despite the fact that you haven't made the metadata changes, that the blocks that have been allocated for file A also won't be allocated for file B. That is, you need to have an in-memory state to reserve blocks, and to provisionally assign various physical block ranges to an inode'sd logical block ranges --- but in a way that doesn't involve storing this information in the file system's metadata.

By George Spelvin at 2024-03-22 20:16:34:

Thanks, Ted, for the technical details.

I should mention, however, that there's an easier hack that works well in specifically the newly-created-file case you mention: allocate the blocks, but don't increase the on-disk i_size. And have the file system know that the data after i_size must be zeroed before use.

This lets you write out the metadata without writing out the data, and without significant new data structures either.

You do need some new code to keep the in-memory st_size and on-disk i_size separate, with the latter being the offset of the first unwritten byte.

It's limited to appending to a file and not e.g. filling holes in a sparse file, but as you say, that's definitely the common case.

Most filesystems use a metadata log, some use metadata+data, some are a metadata+data log, and BSD UFS uses "soft updates", which is a very careful ordering of updates that never leaves metadata inconsistent or not consistent with data. The problem is that it is very difficult to do right, and is very difficult to make changes, so most other filesystems either journal or are log based.

https://www.mckusick.com/softdep/ http://www.sabi.co.uk/blog/12-two.html?120222#120222

There is a an outline of the history of this here:

https://lists.lugod.org/presentations/filesystems.pdf

There is lots more about this is various posts by several authors about the "O_PONIES" discussion.

By Walex at 2024-03-24 14:33:31:

<blockquote>“To translate this to typical system settings, I believe that you want to aggressively trigger disk writeback and perhaps deliberately restrict the total amount of buffered writes that the system can have. Rather than allowing multiple gigabytes of outstanding buffered writes and deferring writeback until a gigabyte or more has accumulated, you'd set things to trigger writebacks almost immediately”</blockquote>

My rule is to allow outstanding writes to be no more than 1-2 seconds of the IO rate applicable.

https://www.sabi.co.uk/blog/05-4th.html?051105#051105 https://www.sabi.co.uk/blog/0707jul.html?070701#070701 https://www.sabi.co.uk/blog/14-two.html?141010#141010 https://www.sabi.co.uk/blog/16-one.html?160114#160114

Written on 17 March 2024.
« Some more notes on Linux's ionice and kernel IO priorities
Sorting out PIDs, Tgids, and tasks on Linux »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Sun Mar 17 21:59:25 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.