The benchmarking problems with potentially too-smart SSDs

August 2, 2014

We've reached the point in building out our new fileservers and iSCSI backends where we're building the one SSD-based fileserver and its backends. Naturally we want to see what sort of IO performance we get on SSDs, partly to make sure that everything is okay, so I fired up my standard basic testing tool for sequential IO. It gave me some numbers, the numbers looked good (in fact pretty excellent), and then I unfortunately started thinking about the fact that we're doing this with SSDs.

Testing basic IO speed on spinning rust is relatively easy because spinning rust is in a sense relatively simple and predictable. Oh, sure, you have different zones and remapped sectors and so on, but you can be all but sure that when you write arbitrary data to disk that it is actually going all the way down to the platters unaltered (well, unless your filesystem does something excessively clever). This matters for my testing because my usual source of test data to write to disk is /dev/zero, and data from /dev/zero is what you could call 'embarrassingly compressible' (and easily deduplicated too).

The thing is, SSDs are not spinning rust and thus are nowhere near as predictable. SSDs contain a lot of magic, and increasingly some of that magic apparently involves internal compression on the data you feed them. When I was writing lots of zeros to the SSDs and then reading them back, was I actually testing the SSD read and write speeds or was I actually testing how fast the embedded processors in the SSDs could recognize zero blocks and recreate them in RAM?

(What matters to our users is the real IO speeds, because they are not likely to read and write zeros.)

Once you start going down the road of increasingly smart devices, the creeping madness starts rolling in remarkably fast. I started out thinking that I could generate a relatively small block of random data (say 4K or something reasonable) and repeatedly write that. But wait, SSDs actually use much larger internal block sizes and they may compress over that larger block size (which would contain several identical copies of my 4K 'simple' block). So I increased the randomness block size to 128K, but now I'm worrying about internal SSD deduplication since I'm writing a lot of copies of this.

The short version of my conclusion is that once I start down this road the only sensible approach is to generate fully random data. But if I'm testing high speed IO in an environment of SSDs and multiple 10G iSCSI networks, I need to generate this random data at a pretty high speed in order to be sure it's not the potential rate limiting step.

(By the way, /dev/urandom may be a good and easy source of random data but it is very much not a high speed source. In fact it's an amazingly slow source, especially on Linux. This was why my initial approach was basically 'read N bytes from /dev/urandom and then repeatedly write them out'.)

PS: I know that I'm ignoring all sorts of things that might affect SSD write speeds over time. Right now I'm assuming that they're going to be relatively immaterial in our environment for hand-waving reasons, including that we can't do anything about them. Of course it's possible that SSDs detect you writing large blocks of zeros and treat them as the equivalent of TRIM commands, but who knows.


Comments on this page:

By Zev Weiss at 2014-08-02 05:09:02:

I've often run into the same problem (generating "random-enough" data at high speed). What I usually end up doing is just taking /dev/zero and running it through one of the faster ciphers offered by openssl enc (make sure not to use an ECB mode though!). My new Haswell box can generate ~670MB/s or so this way using aes-128-cbc, for example.

(Relatedly, one thing I've had on my TODO list for some time is to write a simple command-line [P]RNG offering a range of selectable algorithms/sources at varying points along the quality-vs-speed spectrum.)

By gigaboo@gmail.com at 2014-08-02 05:50:46:

I conduct my I/O with a large (>5GB) video file read from RAM. This provides a non-compressible data set that is reliably repeatable without using slow pseudo-random tools.

In restricted memory situations, I use smaller segments of the file that fit into the working memory set.

By Ewen McNeill at 2014-08-02 05:56:38:

Possible work around (for now): generate a block of random data that is just longer than the internal block of the device (eg SSD erase block), say by a few bytes -- 128kB + 2 bytes. The write that out repeatedly, as if the data block was wrap-around at the end. The result is that each block begins in a slightly different position in the "random" stream, so trivial block by block deduplication is not possible, and each block itself is "random", so trivial in-block compression is not possible either. Compared with /dev/null there is more work in shuffling partial blocks around, but it's not that much slower, especially with a scatter/gather write interface. And you only pay the "random" overhead once up front.

If that still doesn't seem random enough, you could XOR each block with say the block number as you go (but that will require CPU, not just scatter/gather so could be CPU limited).

Both of these have the advantage that given the initial 128kB + 2 bytes, and algorithm, the rest is predictable so you could validate what is read back too.

Ewen

By Christian Neukirchen at 2014-08-02 10:39:28:

As a high-speed source for random data, you can use my tool "rdd": https://github.com/chneukirchen/rdd (does 677MB/s on a Core i5-2400 here)

I was going to comment, but the comment turned out to be large enough that instead I wrote up a blahg post of my own: http://blahg.josefsipek.net/?p=500

By Zev Weiss at 2014-08-02 15:04:22:

And as a small follow-up to my previous comment, I see aes-128-cbc wasn't even a particularly good choice -- aes-128-xts and aes-128-ctr can each do more like ~1.8GB/s on the same hardware.

You can try to write the sequence of 7's times, such as 7, 14, 21, 28, ..., as 4 or 8 bytes. The point is not the randomness, but uncompressable.

Written on 02 August 2014.
« The temptation to rebuild my office machine with its data in ZFS on Linux
Our second generation ZFS fileservers and their setup »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Sat Aug 2 00:08:36 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.