Wandering Thoughts archives

2014-08-24

10G Ethernet is a sea change for my assumptions

We're soon going to migrate a bunch of filesystems to a SSD-based new fileserver, all at once. Such migrations force us to do full backups of migrated filesystems (to the backup system they appear as new filesystems), so a big move means a sudden surge in backup volume. As part of how to handle this surge, I had the obvious thought: we should upgrade the backup server that will handle the migrated filesystems to 10G Ethernet now. The 10G transfer speeds plus the source data being on SSDs would make it relatively simple to back up even this big migration overnight during our regular backup period.

Except I realized that this probably wasn't going to be the case. Our backup system writes backups to disk, specifically to ordinary SATA disks that are not aggregated together in any sort of striped setup, and an ordinary SATA disk might write at 160 Mbytes per second on a good day. This is only slightly faster than 1G Ethernet and certainly nowhere near the reasonable speeds of 10G Ethernet in our environment. We can read data off the SSD-based fileserver and send it over the network to the backup server very fast, but that doesn't really do us anywhere near as much good as it looks when the entire thing is then going to come to a screeching halt by the inconvenient need to write the data to disk on the backup server. 10G will probably help the backup servers a bit, but it isn't going to be anywhere near a great speedup.

What this points out is that my reflexive assumptions are calibrated all wrong for 10G Ethernet. I'm used to thinking of the network as slower than the disks, often drastically, but this is no longer even vaguely true. Even so-so 10G Ethernet performance (say 400 to 500 Mbytes/sec) utterly crushes single disk bandwidth for anything except SSDs. If we get good 10G speeds, we'll be crushing even moderate multi-disk bandwidth (and that's assuming we get full speed streaming IO rates and we're not seek limited). Suddenly the disks are the clear limiting factor, not the network. In fact even a single SSD can't keep up with a 10G Ethernet at full speed; we can see this from the mere fact that SATA interfaces themselves currently max out at 6 Gbits/sec on any system we're likely to use.

(I'd run into this before even for 1G Ethernet, eg here, but it evidently hadn't really sunk into my head.)

PS: I don't know what this means for our backup servers and any possible 10G networking in their future. 10G is likely to improve things somewhat, but the dual 10G-T Intel cards we use don't grow on trees and maybe it's not quite cost effective for them right now. Or maybe the real answer is working out how to give them striped staging disks for faster write speeds.

HDsVs10GEthernet written at 23:42:12; Add Comment

2014-08-09

Intel has screwed up their DC S3500 SSDs

I ranted about this on Twitter a few days ago when we discovered it the hard way but I want to write it down here and then cover why what Intel did is a terrible idea. The simple and short version is this:

Intel switched one of their 'datacenter' SSDs from reporting 512 byte 'physical sectors' to reporting 4096 byte physical sectors in a firmware update.

Specifically, we have the Intel DC S3500 80 GB model in firmware versions D2010355 (512b sectors) and D2010370 (4K sectors). Nothing in the part labeling changed other than the firmware version. Some investigation since our initial discovery has turned up that the 0370 firmware apparently supports both sector sizes and is theoretically switchable between them, and this apparently applies to both the SC3500 and SC3700 series SSDs.

This is a really terrible idea that should never have passed even a basic smell test in a product that is theoretically aimed at 'datacenter' server operations. There are applications where 512b drives and 4K drives are not compatible; for example, in some ZFS pools you can't replace a 512b SSD with a 4K SSD. Creating incompatible drives with the same detailed part number is something that irritates system administrators a great deal and of course it completely ruins the day of people who are trying to have and maintain a spares pool.

This Intel decision is especially asinine because the 'physical sector size' that these SSDs are reporting is essentially arbitrary (as we see here, it is apparently firmware-settable). The actual flash memory itself is clumped together in much larger units in ways that are not well represented by 'physical sector size', which is one reason that all SSDs report whatever number is convenient here.

There may well be good reason to make SSDs report as 4k sector drives instead of 512b drives; if nothing else it is a small bit closer to reality. But having started out with the DC S3500 series reporting 512b sectors Intel should have kept them that way in their out of box state (and made available a utility to switch them to 4k). If Intel felt it absolutely had to change that for some unfathomable reason, it should have at least changed the detailed part number when it updated the firmware; then people maintaining a spares stock would at least have some sign that something was up.

(Hopefully other SSD vendors are not going to get it into their heads to do something this irritating and stupid.)

In related news we now have a number of OmniOS fileservers which we literally have no direct spare system disks for, because their current system SSDs are the 0355 firmware ones.

(Yes, we are working on fixing that situation.)

IntelDCSSDSectorSizeMistake written at 00:04:43; Add Comment

2014-08-08

Hardware can be weird, Intel 10G-T X540-AT2 edition

Every so often I get a pointed reminder that hardware can be very weird. As I mentioned on Twitter today, we've been having one of those incidents recently. The story starts with the hardware for our new fileservers and iSCSI backends, which is built around SuperMicro X9SRH-7TF motherboards. These have an onboard Intel X540-AT2 chipset that provides two 10G-T ports. The SuperMicro motherboard and BIOS lights up these ports no later than when you power the machine on and leave it sitting in the BIOS, and maybe earlier (I haven't tested).

On some but not all of our motherboards, the first 10G-T port lights up (in the BIOS) at 1G instead of 10G. When we first saw this on a board we thought we had a failed board and RMA'd it; the replacement board behaved the same way but when we booted an OS (I believe a Linux) the port came up at 10G and we assumed that all was well. Then we noticed that some but not all of our newly installed OmniOS fileservers had their first port (still) coming up at 1G. At first we thought we had cable issues, but the cables were good.

In the process of testing the situation out, we rebooted one OmniOS fileserver off a CentOS 7 live cd to see if Linux could somehow get 10G out of the hardware. Somewhat to my surprise it could (and a real full 10G at that). More surprising, the port stayed at 10G when we rebooted into OmniOS. It stayed at 10G in OmniOS over a power cycle and it even stayed at 10G after a full power off where we cut power to the entire case for several minutes. Further testing showed that it was sufficient merely to boot the CentOS 7 live cd on an affected server without ever configuring the interface (although it's possible that the live cd configures the interface up to try DHCP and then brings it down again).

There's a lot of weirdness here. It'd be one thing for the Linux driver to bring up 10G where the OmniOS one didn't; then it could be that the Linux driver was more comprehensive about setting up the chipset properly. For it to be so firmly persistent is another thing, though; it suggests that Linux is reprogramming something that stays programmed in nonvolatile storage. And then there's the matter of this happening only on some motherboards and only to one port out of two that are driven by the same chipset.

Ultimately, who knows. We're happy because we apparently have a full solution to the problem, one we've actually carried out on all of the machines now because we needed to get them into production.

(As far as we can easily tell, all of the motherboards and the motherboard BIOSes are the same. We haven't opened up the cases to check the screen printing for changes and aren't going to; these machines are already installed and in production.)

Intel10GTWeirdHardware written at 00:52:22; Add Comment

2014-08-02

The benchmarking problems with potentially too-smart SSDs

We've reached the point in building out our new fileservers and iSCSI backends where we're building the one SSD-based fileserver and its backends. Naturally we want to see what sort of IO performance we get on SSDs, partly to make sure that everything is okay, so I fired up my standard basic testing tool for sequential IO. It gave me some numbers, the numbers looked good (in fact pretty excellent), and then I unfortunately started thinking about the fact that we're doing this with SSDs.

Testing basic IO speed on spinning rust is relatively easy because spinning rust is in a sense relatively simple and predictable. Oh, sure, you have different zones and remapped sectors and so on, but you can be all but sure that when you write arbitrary data to disk that it is actually going all the way down to the platters unaltered (well, unless your filesystem does something excessively clever). This matters for my testing because my usual source of test data to write to disk is /dev/zero, and data from /dev/zero is what you could call 'embarrassingly compressible' (and easily deduplicated too).

The thing is, SSDs are not spinning rust and thus are nowhere near as predictable. SSDs contain a lot of magic, and increasingly some of that magic apparently involves internal compression on the data you feed them. When I was writing lots of zeros to the SSDs and then reading them back, was I actually testing the SSD read and write speeds or was I actually testing how fast the embedded processors in the SSDs could recognize zero blocks and recreate them in RAM?

(What matters to our users is the real IO speeds, because they are not likely to read and write zeros.)

Once you start going down the road of increasingly smart devices, the creeping madness starts rolling in remarkably fast. I started out thinking that I could generate a relatively small block of random data (say 4K or something reasonable) and repeatedly write that. But wait, SSDs actually use much larger internal block sizes and they may compress over that larger block size (which would contain several identical copies of my 4K 'simple' block). So I increased the randomness block size to 128K, but now I'm worrying about internal SSD deduplication since I'm writing a lot of copies of this.

The short version of my conclusion is that once I start down this road the only sensible approach is to generate fully random data. But if I'm testing high speed IO in an environment of SSDs and multiple 10G iSCSI networks, I need to generate this random data at a pretty high speed in order to be sure it's not the potential rate limiting step.

(By the way, /dev/urandom may be a good and easy source of random data but it is very much not a high speed source. In fact it's an amazingly slow source, especially on Linux. This was why my initial approach was basically 'read N bytes from /dev/urandom and then repeatedly write them out'.)

PS: I know that I'm ignoring all sorts of things that might affect SSD write speeds over time. Right now I'm assuming that they're going to be relatively immaterial in our environment for hand-waving reasons, including that we can't do anything about them. Of course it's possible that SSDs detect you writing large blocks of zeros and treat them as the equivalent of TRIM commands, but who knows.

SSDBenchmarkingConcerns written at 00:08:36; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.