Getting high IOPS requires concurrency on modern SSDs and NVMe drives

February 11, 2021

My intuitions (or unthinking assumptions) about disk performance date back far enough that one of them is that a single program acting on its own can get the disk's 'normal' random read performance for plain ordinary reads (which are pretty much synchronous and issued one at a time). This was more or less true on hard drives (spinning rust), where your program and the operating system had more than enough time on their hands to saturate the drive's relatively small 100 to 150 IOPS rate. This is probably not true on modern SSDs, and is definitely not true on NVMe drives.

In order to deliver their full rated performance, modern NVMe drives and the operating system interfaces to them require you to saturate their command queues with constant activity (which means that IOPS ratings don't necessarily predict single request latency). Similarly, those impressive large random IO numbers for SSDs are usually measured at high queue depths. This presents some practical problems for real system configurations, because to get a high queue depth you must have a lot of concurrent IO. There are two levels of issues, the program level and then the system level.

On the program level, writes can generally achieve high concurrency if you have a high write volume because most writes are asynchronous; your program hands them to the operating system and then the operating system dispatches them while your program generates the next set of writes. The obvious exception is if you're performing synchronous writes or otherwise actually waiting for the data to be really written to disk. Reads are another matter. If you have a single program performing a single read at a time, you can't get high queue depths (especially if you're only reading a small amount of data). To get higher levels of concurrent read requests, either the program has to somehow issue a lot of separate read requests at once or you need multiple processes active, all reading independently. Often this isn't going to be all that simple.

Once you have enough concurrency at the program level you need to be in an environment where there's nothing in the operating system that's forcing this concurrency to be serialized. Unfortunately there are all sorts of things inside filesystems that might partially serialize either writes or reads, especially at volume. For instance, random reads in large files generally require the filesystem to load indirect mapping blocks into memory (to go from a byte offset to a location on disk). If you have two concurrent reads for separate locations that both need the same indirect mapping block to be read into memory, they've both blocked on a single resource. Similarly, writing data out may require loading free space information into memory, or writing out updates to it back to disk.

SSDs and NVMe drives are still very fast for single random IOs at a time (although we don't generally know how fast, since people only rarely measure that and it's dependent on your operating system). But they aren't necessarily as fast as they look on the specification sheet unless you really load up the rest of your system, and that's a change from the past. Getting really top notch performance from our SSDs and NVMe drives likely needs a more concurrent, multi-process overall system than we needed in the past. Conversely, a conventional system with limited concurrency may not get quite the huge performance numbers we expect from the SSD and NVMe spec sheet numbers, although it should still do pretty well.

(It would be nice to have some performance numbers for 'typical IOPS or latency for minimal single read and write requests' for both SSDs and NVMe drives, just so we could get an idea of the magnitude involved. Do IOPS drop to a half? To a fifth? To a tenth? I don't know, and I only have moderately good ways of measuring it.)

PS: This may well have been obvious to many people for some time, but it hadn't really struck me until very recently.


Comments on this page:

By Andrew at 2021-02-12 09:22:26:

For ordinary desktop use this could almost be seen as an advantage. It means that if you have one program hitting the disk as hard as it can in a single-threaded fashion, there will still be enough left over to serve all of the little random reads that are needed to keep the system responsive while you do something else like browse the web or answer email.

By Randall at 2021-02-13 13:36:50:

Re: understanding performance without concurrency, the term people use testing SSDs use for one-write-at-the-time is "QD1", for "NVMe queue depth 1." And, order of magnitude, Flash SSDs mostly answer reads in about 0.1ms, probably bound by the underlying Flash--doesn't seem to have changed as drastically as bandwidth or QD64 numbers.

For example, this is not a datacenter drive, but https://www.anandtech.com/show/16087/the-samsung-980-pro-pcie-4-ssd-review/5 has a QD1 graph at the top, and a mixed QD1/2/4 below. (The outlier at the top is an Intel Optane drive--the medium is faster to read but also incredibly pricey and haven't seen the hoped-for price drop.)

Definitely takes threads or AIO or something to fully exploit random access on these drives. Apparently you can get avgqu-sz from Linux's iostat, though sometimes it might be simpler to infer it from overall performance numbers. (The I/O stack does lots for us, but sometimes the lack of transparency is not super convenient...)

Finally, it's consistent with how you phrased the title, but worth noting big sequential reads, or medium-concurrency medium-size reads, can fully exercise the drive much like high-concurrency 4kb reads do: you can think of a 16kb read as taking the same "slots" as four simultaneous 4kb reads.

By Randall at 2021-02-13 13:53:07:

(To clarify, I'm not saying a 16kb read literally uses four NVMe queue slots--just saying it's useful as a rough mental model of how increasing either read size or queue depth can have similar effects on how busy the hardware is.)

Written on 11 February 2021.
« Let's Encrypt is preparing for an emergency and that's good for TLS in general
An interesting issue around using is with a literal in Python »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Thu Feb 11 23:43:28 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.