Getting high IOPS requires concurrency on modern SSDs and NVMe drives
My intuitions (or unthinking assumptions) about disk performance date back far enough that one of them is that a single program acting on its own can get the disk's 'normal' random read performance for plain ordinary reads (which are pretty much synchronous and issued one at a time). This was more or less true on hard drives (spinning rust), where your program and the operating system had more than enough time on their hands to saturate the drive's relatively small 100 to 150 IOPS rate. This is probably not true on modern SSDs, and is definitely not true on NVMe drives.
In order to deliver their full rated performance, modern NVMe drives and the operating system interfaces to them require you to saturate their command queues with constant activity (which means that IOPS ratings don't necessarily predict single request latency). Similarly, those impressive large random IO numbers for SSDs are usually measured at high queue depths. This presents some practical problems for real system configurations, because to get a high queue depth you must have a lot of concurrent IO. There are two levels of issues, the program level and then the system level.
On the program level, writes can generally achieve high concurrency if you have a high write volume because most writes are asynchronous; your program hands them to the operating system and then the operating system dispatches them while your program generates the next set of writes. The obvious exception is if you're performing synchronous writes or otherwise actually waiting for the data to be really written to disk. Reads are another matter. If you have a single program performing a single read at a time, you can't get high queue depths (especially if you're only reading a small amount of data). To get higher levels of concurrent read requests, either the program has to somehow issue a lot of separate read requests at once or you need multiple processes active, all reading independently. Often this isn't going to be all that simple.
Once you have enough concurrency at the program level you need to be in an environment where there's nothing in the operating system that's forcing this concurrency to be serialized. Unfortunately there are all sorts of things inside filesystems that might partially serialize either writes or reads, especially at volume. For instance, random reads in large files generally require the filesystem to load indirect mapping blocks into memory (to go from a byte offset to a location on disk). If you have two concurrent reads for separate locations that both need the same indirect mapping block to be read into memory, they've both blocked on a single resource. Similarly, writing data out may require loading free space information into memory, or writing out updates to it back to disk.
SSDs and NVMe drives are still very fast for single random IOs at a time (although we don't generally know how fast, since people only rarely measure that and it's dependent on your operating system). But they aren't necessarily as fast as they look on the specification sheet unless you really load up the rest of your system, and that's a change from the past. Getting really top notch performance from our SSDs and NVMe drives likely needs a more concurrent, multi-process overall system than we needed in the past. Conversely, a conventional system with limited concurrency may not get quite the huge performance numbers we expect from the SSD and NVMe spec sheet numbers, although it should still do pretty well.
(It would be nice to have some performance numbers for 'typical IOPS or latency for minimal single read and write requests' for both SSDs and NVMe drives, just so we could get an idea of the magnitude involved. Do IOPS drop to a half? To a fifth? To a tenth? I don't know, and I only have moderately good ways of measuring it.)
PS: This may well have been obvious to many people for some time, but it hadn't really struck me until very recently.