Thinking about how much asynchronous disk write buffering you want
Pretty much every modern system defaults to having data you write to filesystems be buffered by the operating system and only written out asynchronously; you have to take special steps either to make your write IO synchronous or to force it to disk (which can lead to design challenges). When the operating system is buffering data like this, one obvious issue is the maximum amount of data it should let you buffer up before you have to slow down or stop.
Let's start with two obvious observations. First, if you write enough data, you will always have to eventually slow down to the sustained write speed of your disk system. The system only has so much RAM; even if the OS lets you use it all, eventually it will all be filled up with your pending data and at that point you can only put more data into the buffer when some earlier data has drained out. That data drains out at the sustained write speed of your disk system. The corollary to this is that if you're going to write enough data, there is very little benefit to letting you fill up lots of a write buffer; the operating system might as well limit you to the sustained disk write speed relatively early.
Second, RAM being used for write buffers is often space taken away from other productive uses for that RAM. Some times you will read back some of the written data and be able to get it from RAM, but if it is purely written (for now) then the RAM is otherwise wasted, apart from any benefits that write buffering may get you. By corollary with our first observation, buffering huge amounts of write-only data for a program that is going to be limited by disk write speed is not productive (because it can't even speed the program up).
So what are the advantages of having some amount of write buffering, and how much do we need to get them?
- It speeds up programs that write occasionally or only once
and don't force their data to be flushed to the physical disk.
If their data fits into the write buffer, these programs can
continue immediately (or exit immediately), possibly giving
them a drastic performance boost. The OS can then write the
data out in the background as other things happen.
(Per our first observation, this doesn't help if the collection of programs involved write too much data too fast and overwhelm the disks and possibly your RAM with the speed and volume.)
- It speeds up programs that write in bursts of high bandwidth. If
your program writes a 1 GB burst every minute, a 1 GB or more
write buffer means that it can push that GB into the OS very fast,
instead of being limited to the (say) 100 MB/s of actual disk
write bandwidth and taking ten seconds or more to push out its
data burst. The OS can then write the data out in the background
and clear the write buffer in time for your next burst.
- It can eliminate writes entirely for temporary data. If you write data,
possibly read it back, and then delete the data fast enough, the
data needs never be written to disk if it can be all kept in the
write buffer. Explicitly forcing data to disk obviously defeats
this, which leads to some tradeoffs in programs that create
- It allows the OS to aggregate writes together for better performance
and improved data layout on disk. This is most useful when your
program issues comparatively small writes to the OS, because
otherwise there may not be much improvement to be had from
aggregating big writes into really big writes. OSes generally
have their own limits on how much they will aggregate together
and how large a single IO they'll issue to disks, which clamps
the benefit here.
(Some of the aggregation benefit comes from the OS being able to do a bunch of metadata updates at once, for example to mark a whole bunch of disk blocks as now used.)
More write buffer here may help if you're writing to multiple different files, because it allows the OS to hold back writes to some of those files to see if you'll write more data to them soon enough. The more files you write to, the more streams of write aggregation the OS may want to keep active and the more memory it may need for this.
(With some filesystems, write aggregation will also lead to less disk space being used. Many filesystems that compresses data are one example, and ZFS in general can be another one, especially on RAIDZ vdevs (and also).)
- If the OS starts writing out data in the background soon enough,
a write buffer can reduce the amount of time a program takes to
write a bunch of data and then wait for it to be flushed to disk.
How much this helps depends partly on how fast the program can
generate data to be written; for the best benefit, you want this
to be faster than the disk write rate but not so fast that the
program is done before much background write IO can be started
(Effectively this converts apparently synchronous writes into asynchronous writes, where actual disk IO overlaps with generating more data to be written.)
Some of these benefits require the OS make choices that push against each other. For example, the faster the OS starts writing out buffered data in the background, the more it speeds up the overlapping write and compute case but the less chance it has to avoid flushing data to disk that's written but then rapidly deleted (or otherwise discarded).
How much write buffering you want for some of these benefits depends very much on what your programs do (individually and perhaps in the aggregate). If your programs write only in bursts or fall into the 'write and go on' pattern, you only need enough write buffer to soak up however much data they're going to write in a burst so you can smooth it out again. Buffering up huge amounts of data for them beyond that point doesn't help (and may hurt, both by stealing RAM from more productive uses and from leaving more data exposed in case of a system crash or power loss).
There is also somewhat of an inverse relationship between the useful size of your write buffer and the speed of your disk system. The faster your disk system can write data, the less write buffer you need in order to soak up medium sized bursts of writes because that write buffer clears faster. Under many circumstances you don't need the write buffer to store all of the data; you just need it to store the difference between what the disks can write over a given time and what your applications are going to produce over that time.
(Conversely, very slow disks may call in theory call for very big OS write buffers, but there are often practical downsides to that.)
How we failed at making all our servers have SSD system disks
Several years ago I wrote an entry about why we're switching to SSDs for system disks, yet the other day there I was writing about how we recycle old disks to be system disks and maybe switching to fixed size root filesystems to deal with some issues there. A reasonable person might wonder what happened between point A and point B. What happened is not any of the problems that I thought might happen; instead it is a story of good intentions meeting rational but unfortunate decisions.
The first thing that happened was that we didn't commit whole-heartedly to this idea. Instead we decided that even inexpensive SSDs were still costly enough that we wouldn't use them on 'less important' machines; instead we'd reuse older hard drives on some machines. This opened a straightforward wedge in our plans, because now we had to decide if a machine was important enough for SSDs and we could always persuade ourselves that the answer was 'no'.
(It would have been one thing if we'd said 'experimental scratch machines get old HDs', but we opened it up to 'less important production machines'.)
Our next step was that we didn't buy (and continue to buy) enough SSDs to always clearly have plenty of SSDs in stock. The problem here is straightforward; if you want to make something pervasive in the servers that you set up, you need to make it pervasive on your stock shelf, and you need to establish the principle that you're always going to have more. This holds just as true for SSDs for us as it does for RAM; once we had a limited supply, we had an extra reason to ration it, and we'd already created our initial excuse when we decided that some servers could get HDs instead of SSDs.
Then as SSD stocks dwindled below a critical point, we had the obvious reaction of deciding that more and more machines weren't important enough to get SSDs as their system disks. This was never actively planned and decided on (and if it had been planned, we might have ordered more SSDs). Instead it happened bit by bit; if I was setting out to set up a server, and we had only (say) four SSDs left, I have to decide on the spot if my server is that important. It's easy to talk myself into saying 'I guess not, this can live with HDs', because I have to make a decision right then on the spot in order to keep moving forward on putting the server together.
(Had we sat down to plan out, say, our next four or five servers that we were going to build and talked about which ones were and weren't important, we might well have ordered more SSDs because the global situation would have been clearer and we would have been doing this further in advance. On the spot decision making is not infrequently driven to be focused on the short term and the immediate perspective, instead of a long term and global one.)
At this point we have probably flipped over to a view that HDs are the default on new or replacement servers and a server has to strike us as relatively special to get SSDs. This is pretty much the inverse of where we started out, although arguably it's a rational and possibly even correct response to budget pressures and so on. In other words, maybe our initial plan was always over-ambitious for the realities of our environment. It did help, because we got SSDs into some important servers and thus we've probably made them less likely to have disk failures.
A contributing factor is that it turned out to be surprisingly annoying to put SSDs in the 3.5" drive bays in a number of our servers, especially Dell R310s, because they have strict alignment requirements for the SATA and power connectors, and garden variety SSD 2.5" to 3.5" adaptors don't put the SSDs at the right place for this. Getting SSDs into such machines required extra special hardware; this added extra hassle, extra parts to keep in stock, and extra cost.
(This entry is partly in the spirit of retrospectives.)