The benchmarking problems with potentially too-smart SSDs
We've reached the point in building out our new fileservers and iSCSI backends where we're building the one SSD-based fileserver and its backends. Naturally we want to see what sort of IO performance we get on SSDs, partly to make sure that everything is okay, so I fired up my standard basic testing tool for sequential IO. It gave me some numbers, the numbers looked good (in fact pretty excellent), and then I unfortunately started thinking about the fact that we're doing this with SSDs.
Testing basic IO speed on spinning rust is relatively easy because
spinning rust is in a sense relatively simple and predictable. Oh,
sure, you have different zones and remapped sectors and so on, but
you can be all but sure that when you write arbitrary data to disk
that it is actually going all the way down to the platters unaltered
(well, unless your filesystem does something excessively clever).
This matters for my testing because my usual source of test data
to write to disk is
/dev/zero, and data from
/dev/zero is what
you could call 'embarrassingly compressible' (and easily deduplicated
The thing is, SSDs are not spinning rust and thus are nowhere near as predictable. SSDs contain a lot of magic, and increasingly some of that magic apparently involves internal compression on the data you feed them. When I was writing lots of zeros to the SSDs and then reading them back, was I actually testing the SSD read and write speeds or was I actually testing how fast the embedded processors in the SSDs could recognize zero blocks and recreate them in RAM?
(What matters to our users is the real IO speeds, because they are not likely to read and write zeros.)
Once you start going down the road of increasingly smart devices, the creeping madness starts rolling in remarkably fast. I started out thinking that I could generate a relatively small block of random data (say 4K or something reasonable) and repeatedly write that. But wait, SSDs actually use much larger internal block sizes and they may compress over that larger block size (which would contain several identical copies of my 4K 'simple' block). So I increased the randomness block size to 128K, but now I'm worrying about internal SSD deduplication since I'm writing a lot of copies of this.
The short version of my conclusion is that once I start down this road the only sensible approach is to generate fully random data. But if I'm testing high speed IO in an environment of SSDs and multiple 10G iSCSI networks, I need to generate this random data at a pretty high speed in order to be sure it's not the potential rate limiting step.
(By the way,
/dev/urandom may be a good and easy source of random
data but it is very much not a high speed source. In fact it's an
amazingly slow source, especially on Linux. This was why my initial
approach was basically 'read N bytes from
/dev/urandom and then
repeatedly write them out'.)
PS: I know that I'm ignoring all sorts of things that might affect SSD write speeds over time. Right now I'm assuming that they're going to be relatively immaterial in our environment for hand-waving reasons, including that we can't do anything about them. Of course it's possible that SSDs detect you writing large blocks of zeros and treat them as the equivalent of TRIM commands, but who knows.