Some thoughts on performance shifts in moving from an iSCSI SAN to local SSDs
At one level, we're planning for our new fileserver environment to be very similar to our old one. It will still use ZFS and NFS, our clients will treat it the same, and we're even going to be reusing almost all of our local management tools more or less intact. At another level, though, it's very different because we're dropping our SAN in this iteration. Our current environment is an iSCSI-based SAN using HDs, where every fileserver connects to two iSCSI backends over two independent 1G Ethernet networks; mirrored pairs of disks are split between backends, so we can lose an entire backend without losing any ZFS pools. Our new generation of hardware uses local SSDs, with mirrored pairs of disks split between SATA and SAS. This drastic low level change is going to change a number of performance and failure characteristics of our environment, and today I want to think aloud about how the two environments will differ.
(One reason I care about their differences is that it affects how we want to operate ZFS, by changing what's slow or user-visible and what's not.)
In our current iSCSI environment, we have roughly 200 MBytes/sec of total read bandwidth and write bandwidth across all disks (which we can theoretically get simultaneously) and individual disks can probably do about 100 to 150 MBytes/sec of some combination of reads and writes. With mirrors, we have 2x write amplification from incoming NFS traffic to outgoing iSCSI writes, so 100 Mbytes/sec of incoming NFS writes saturates our disk write bandwidth (and it also seems to squeeze our read bandwidth). Individual disks can do on the order of 100 IOPs/sec, and with mirrors, pure read traffic can be distributed across both disks in a pair for 200 IOPs/sec in total. Disks are shared between multiple pools, which visibly causes problems, possibly because the sharing is invisible to our OmniOS fileservers so they do a bad job of scheduling IO.
Faults have happened at all levels of this SAN setup. We have lost individual disks, we have had one of the two iSCSI networks stop being used for some or all of the disks or backends (usually due to software issues), and we have had entire backends need to be rotated out of service and replaced with another one. When we stop using one of the iSCSI networks for most or all disks of one backend, that backend drops to 100 Mbytes/sec of total read and write bandwidth, and we've had cases where the OmniOS fileserver just stopped using one network so it was reduced to 100 Mbytes/sec to both backends combined.
On our new hardware with local Crucial MX300 and MX500 SSDs, each individual disk has roughly 500 Mbytes/sec of read bandwidth and at least 250 Mbytes/sec of write bandwidth (the reads are probably hitting the 6.0 Gbps SATA link speed limit). The SAS controller seems to have no total bandwidth limit that we can notice with our disks, but the SATA controller appears to top out at about 2000 Mbytes/sec of aggregate read bandwidth. The SSDs can sustain over 10K read IOPs/sec each, even with all sixteen active at once. With a single 10G-T network connection for NFS traffic, a fileserver can do at most about 1 GByte/sec of outgoing reads (which theoretically can be satisfied from a single pair of disk) and 1 GByte/sec of incoming writes (which would likely require at least four disk pairs to get enough total write bandwidth, and probably more because we're writing additional ZFS metadata and periodically forcing the SSDs to flush and so on).
As far as failures go, we don't expect to lose either the SAS or the SATA controllers, since both of them are integrated into the motherboard. This means we have no analog of an iSCSI backend failure (or temporary unavailability), where a significant number of physical disks are lost at once. Instead the only likely failures seem to be the loss of individual disks and we certainly hope to not have a bunch fall over at once. I have seen a SATA-connected disk drop from a 6.0 Gbps SATA link speed down to 1.5 Gbps, but that may have been an exceptional case caused by pulling it out and then immediately re-inserting it; this dropped the disk's read speed to 140 MBytes/sec or so. We'll likely want to monitor for this, or in general for any link speed that's not 6.0 Gbps.
(We may someday have what is effectively a total server failure, even if the server stays partially up after a fan failure or a partial motherboard explosion or whatever. But if this happens, we've already accepted that the server is 'down' until we can physically do things to fix or replace it.)
In our current iSCSI environment, both ZFS scrubs to check data integrity and ZFS resilvers to replace failed disks can easily have a visible impact on performance during the workday and they don't go really fast even after our tuning; this is probably not surprising given both total read/write bandwidth limits from 1G networking and IOPs/sec limits from using HDs. When coupled with our multi-tenancy, this means that we've generally limited how much scrubbing and resilvering we'll do at once. We may have historically been too cautious about limiting resilvers (they're cheaper than you might think), but we do have a relatively low total write bandwidth limit.
Our old fileservers couldn't have the same ZFS pool use two chunks from the same physical disk without significant performance impact. On our new hardware this doesn't seem to be a problem, which suggests that we may experience much less impact from multi-tenancy (which we're still going to have, due to how we sell storage). This is intuitively what I'd expect, at least for random IO, since SSDs have so many IOPs/sec available; it may also help that the fileserver can now see that all of this IO is going to the same disk and schedule it better.
On our new hardware, test ZFS scrubs and resilvers have run at anywhere from 250 Mbyte/sec on upward (on mirrored pools), depending on the test pool's setup and contents. With high SSD IOPs/sec and read and write bandwidth (both to individual disks and in general), it seems very likely that we can be much more aggressive about scrubs and resilvers without visibly affecting NFS fileserver performance, even during the workday. With an apparent 6000 Mbytes/sec of total read bandwidth and perhaps 4000 Mbytes/sec of total write bandwidth, we're pretty unlikely to starve regular NFS IO with scrub or resilver IO even with aggressive tuning settings.
(One consequence of hoping to mostly see single-disk failures is that under normal circumstances, a given ZFS pool will only ever have a single failed 'disk' from a single vdev. This makes it much less relevant that resilvering multiple disks at once in a ZFS pool is mostly free; the multi-disk case is probably going to be a pretty rare thing, much rarer than it is in our current iSCSI environment.)
Remembering that Python lists can use tuples as the sort keys
I was recently moving some old Python 2 code to Python 3 (due to
a recent decision). This
particular code is sufficiently old that it has (or had) a number
of my old Python code habits, and in
particular it made repeated use of list
.sort() with comparison
functions. Python 3 doesn't support this; instead you have to tell
.sort() what key to use to sort the list.
For a lot of the code the conversion was straightforward and obvious
because it was just using a field from the object as the sort key.
Then I hit a comparison function that looked like this:
def _pricmp(a, b): apri = a.prio or sys.maxint bpri = b.prio or sys.maxint if apri != bpri: return cmp(apri, bpri) return cmp(a.totbytes, b.totbytes)
I stared at this with a sinking feeling, because this comparison function wasn't just picking a field, it was expressing logic. Losing complex comparison logic is a long standing concern of mine, so I was worried that I'd finally run into a situation where I would be forced into unpleasant hacks.
Then I remembered something obvious: Python supports sorting on
tuples, not just single objects. Sorting on tuples compares the
two tuples field by field, so you can easily implement the same
sort of tie-breaking secondary comparison that I was doing in
_pricmp. So I wrote a simple function to generate the tuple
of key fields:
def _prikey(a): apri = a.prio or sys.maxint return (apri, a.totbytes)
Unsurprisingly, this just worked (including the tie-breaking, which actually comes up fairly often in this particular comparison). It's probably even somewhat clearer, and it certainly avoids some potential comparison function mistakes
(It's also shorter, but that's not necessarily a good thing.)
PS: Python has supported sorting tuples for a long time but I don't
usually think about it, so things had to swirl around in my head
for a bit before the light dawned about how to solve my issue.
There's a certain mental shift that you need to go from 'the
function retrieves the key field' to 'the
key= function creates
the sort key, but it's usually a plain field value'.