Why writes to disk generally wind up in your OS's disk read cache

February 3, 2025

Recently, someone was surprised to find out that ZFS puts disk writes in its version of a disk (read) cache, the ARC ('Adaptive Replacement Cache'). In fact this is quite common, as almost every operating system and filesystem puts ordinary writes to disk into their disk (read) cache. In thinking about the specific issue of the ZFS ARC and write data, I realized that there's a general broad reason for this and then a narrower technical one.

The broad reason that you'll most often hear about is that it's not uncommon for your system to read things back after you've written them to disk. It would be wasteful to having something in RAM, write it to disk, remove it from RAM, and then have to more or less immediately read it back from disk. If you're dealing with spinning HDDs, this is quite bad since HDDs can only do a relatively small amount of IO a second; in this day of high performance, low latency NVMe SSDs, it might not be so terrible any more, but it still costs you something. Of course you have to worry about writes flooding the disk cache and evicting more useful data, but this is also an issue with certain sorts of reads.

The narrower technical reason is dealing with issues that come up once you add write buffering to the picture. In practice a lot of ordinary writes to files aren't synchronously written out to disk on the spot; instead they're buffered in memory for some amount of time. This require some pool of (OS) memory to hold the these pending writes, which might as well be your regular disk (read) cache. Putting not yet written out data in the disk read cache also deals with the issue of coherence, where you want programs that are reading data to see the most recently written data even if it hasn't been flushed out to disk yet. Since reading data from the filesystem already looks in the disk cache, you'll automatically find the pending write data there (and you'll automatically replace an already cached version of the old data). If you put pending writes into a different pool of memory, you have to specifically manage it and tune its size, and you have to add extra code to potentially get data from it on reads.

(I'm going to skip considering memory mapped IO in this picture because it only makes things even more complicated, and how OSes and filesystems handle it potentially varies a lot. For example, I'm not sure if Linux or ZFS normally directly use pages in the disk cache, or if even shared memory maps get copies of the disk cache pages.)

PS: Before I started thinking about the whole issue as a result of the person's surprise, I would have probably only given you the broad reason off the top of my head. I hadn't thought about the technical issues of not putting writes in the read cache before now.


Comments on this page:

And for the most part, the two reasons are close enough to be the same thing. Most people never write operating systems, databases or file systems.

Even Intel amd64 CPUs have write-through data caches for L1d, L2, and L3. Write-around is triggered by special non-temporal stores.

And Dan Luu points out in one of his file system and fsync-gate posts that the old (crusty) UNIX/POSIX evolved defacto standards (a.k.a. fsync(2)) don’t make much sense, as soon as CPU cache design ideas from the 1990s are taken into account. (fsync is worse-is-better all the way down to lost Postgres data. I read the fsync-gate email threads after the fact and it’s scary. Since then data goes onto FreeBSD/ZFS with simple mirrors and the rest can take a hike due to lack of measurable attention span. Any worse and the only way to have trustworthy computers is by making my own wafers and going from there.)

Written on 03 February 2025.
« Web spiders (or people) can invent unfortunate URLs for your website
The practical (Unix) problems with .cache and its friends »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Mon Feb 3 22:44:01 2025
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.