Wandering Thoughts archives

2011-05-23

An aside on RAID-5/RAID-6 and disk failures

In light of yesterday's entry, one might sensibly ask why RAID-5 (or RAID-6) systems don't try harder to limp along after enough disks fail. In my view, the short answer is that the RAID systems are lazy.

Most RAID-N systems do not bother trying to verify the data on read; if the disk does not report a read error, they just give the data to you. Thus, even if they lost enough disks to become nominally non-functional, they could still return some data if you read from them. They would have to fail reads from the gone disks and for anything with a disk read error, but they could satisfy other reads with no worse guarantees than they usually provide.

(ZFS is an exception, because as discussed previously it does verify checksums on reads and needs all of the blocks of a stripe in order to do so. This means that if you lose enough disks on a ZFS raidzN, everything suddenly has unverifiable checksums and ZFS must fail all reads.)

While you could wave your hands about it, RAID-N systems in this state would pretty much have to fail all writes because they now have nowhere to put parity information. You could argue that they could merely fail writes that would wind up on now-gone disks on the basis that this is no worse than allowing writes to a degraded RAID-N (any read errors or disk losses are unrecoverable in either case), but I think that this is pushing it.

So, RAID-N systems could do better; they just don't bother because it's simpler to just give up immediately. This laziness is probably boosted by the fact that even if the RAID-N continued to return some data on reads the filesystem itself is probably toast, since as discussed yesterday very few filesystems deal well when a significant amount of their storage becomes corrupt or unreadable.

(This is the kind of entry that I write at least partly for myself since it forces me to work through the logic.)

OnRAID5Failures written at 01:39:17; Add Comment

2011-05-22

Why losing part of a striped RAID is fatal even on smart filesystems

I wrote recently (here) that losing a chunk of striped or concatenated storage was essentially instantly fatal for any RAID system, smart ones like ZFS included. Once you start thinking about it, this is a bit peculiar for smart systems like ZFS. ZFS is generally self-healing, after all; why can't it at least try to heal from this loss, and why can't it organize itself so that this sort of loss is as unlikely as possible to be unrecoverable?

(In ZFS terms, I'm talking about the total loss of one vdev in the pool. This is a different thing from the failure of a RAID-5 or RAID-6 array when enough disks go bad at once.)

In theory, recovery from a chunk loss seems at least possible. Smart filesystems like ZFS already have a well developed idea of partial damage, where they can identify that certain files or entire directories are damaged or inaccessible, so in theory they could simply mark all pieces of the filesystem that depended on the destroyed chunk as damaged and keep going. Of course this might not work, for two different reasons.

First, the filesystem could have important top level metadata on the lost chunk. If you lose metadata, you lose everything under it; if you lose top-level metadata, that's everything in the filesystem. Second, the filesystem could have placed enough data and metadata on the lost chunk that basically everything in the filesystem is damaged to some degree. The extreme situation is classic striping, where any object over a certain small size is distributed over all chunks and so loss of one chunk damages almost all objects in the filesystem.

(If you are lucky, there is a chain of intact metadata that leads to the object so you can at least recover some of the data. But this is getting obscure.)

So, you say, why not change a filesystem to harden it against this sort of thing? The problem there is what this requires. You can get part of the way by having redundant copies of metadata on different chunks, but this still leaves you losing data from many or all sufficiently large files; since the data is the important thing, this may not really get you all that much. To do a really good job, you need to try to isolate the damage of a lost chunk by deliberately not striping file data across multiple chunks. This costs you performance in various ways.

In practice, no one wants their filesystems to do this. After all, if they want this sort of hardening and are willing to live with the performance impact, the simple approach is not to stripe the storage and just make separate filesystems.

(With that said, current smart filesystems could do better. ZFS makes redundant copies of metadata by default, but I believe it still simply gives up if a vdev fails rather than at least trying to let you read what's still there. This is sadly typical of the ZFS approach to problems.)

WhyLosingStripeFatal written at 02:42:27; Add Comment

2011-05-19

One limitation of simple bisection searches in version control systems

The usual description of VCS bisection is that it performs a binary search to determine the changeset that introduced a bug. Unfortunately, there is a fundamental limitation with this model of bisection; it only works well when the bug was only introduced once. Otherwise you will get results that are technically correct but not really useful.

Suppose what actually happened was that the bug was unknowingly introduced, accidentally fixed, and then later reintroduced. Doing a bisection will 'work' in that it will give you a changeset where the bug appeared, but what you really want is not just any old changeset where the bug appeared but the most recent changeset where it appeared. There is no guarantee that bisection will give you this changeset, because there is no guarantee that binary search will.

Binary search implicitly requires that the indexed set being searched is ordered; when the indexed set is not ordered, you get strange results. In cases like bisection where there are effectively only two numbers ('good' and 'bad'), having the set be ordered means that all of the 'bad' comes after the 'good' (or vice versa). If this is not the case, binary search will find one of the division points between good and bad but there are no guarantees of which one it is.

This situation is difficult to detect (although there are heuristics in extreme cases) and impossible to completely avoid or fix in anything that you actually want to use.

(One obvious heuristic is to see if the changeset has been completely overwritten by later changes. If it's not contributing any code to the current tree, it's not contributing any bugs either.)

PS: this is what I was thinking of yesterday.

VCSBisectProblem written at 01:16:12; Add Comment

2011-05-17

What we could use 10G Ethernet for in the near future

As I've written before, I don't think we're going to see any great use of 10G Ethernet in our immediate future; it's still too expensive to see it routinely start showing up on machines. But I've recently been thinking about what we could feasibly use 10G Ethernet for in the relatively near future, if (for example) we got money to buy a modest amount of 10G gear.

(I think that this is an interesting question in general. For the moderately near future, many places will be able to afford a few 10G connections but only a few will be able to afford a significant number of them. So the question is: what can you use a modest number of 10G connections for that makes sense?)

The nominally obvious place to deploy 10G Ethernet is in a fileserver infrastructure. Ignoring any lack of need for it, the problem is that when you have a SAN, increasing the speed of one network connection doesn't really help because you just wind up bandwidth constrained on other connections. We could give our fileservers 10G connections to the outside world but they'd only be able to use a fraction of it, since they only have two 1G connections into the SAN fabric. If we gave them 10G SAN links, the backends need fast connections or we max out at four 1G connections (at best). All of this adds up to a lot of 10G connections (and a lot of changes to our systems, but let's ignore that).

A lot of the other places that we could sprinkle 10G connections over have similar issues; isolated 10G connections between a few machines generally don't do us any good, because we don't have any super-hot, bandwidth constrained machines. This is in a sense the downside of building a balanced architecture; lifting it up requires improving a bunch of places at once.

However, there is one place that we could usefully deploy a small number of 10G ports. The core of our physical network is a set of switches chained together through pairs of 10G ports (each switch has two). This is an increasingly awkward architecture as time goes by (for reasons beyond the scope of this entry). We would love to move to a star model where there is a central 10G switch that all of the core switches uplink to; at this point even a four-port 10G switch would give us a useful simplification of the topology.

(Given an increasing use of 10G for switch to switch links, I suspect that this will be a major way that 10G works its way into most machine rooms. We can't be alone in having this topology problem.)

Our10GImmediateFuture written at 01:47:28; Add Comment

2011-05-05

Thinking about when not disabling iSCSI's InitialR2T matters

A commenter on a recent iSCSI entry asked me if disabling initial R2T had made any difference. I can't answer the question because we haven't actually done any of my iSCSI tuning ideas yet, but in the process of saying that I wound up thinking about why I didn't expect disabling initial R2T to make much of a difference for us.

Let's backtrack for a moment and ask what performance impact not disabling initial R2T has. After the dust settles, requiring an initial R2T delays every SCSI WRITE by the time it takes to send the first R2T from the target to the initiator (and to have it processed on both ends). It also adds an extra iSCSI PDU to the network from target to initiator (which may or may not result in an actual extra TCP packet, depending on what else is going on at the time). When does this matter?

I will skip to my conclusion: InitialR2T is a setting that only really matters over high-latency WAN connections and perhaps in some exotic situations with synchronous writes to very, very fast storage.

Most writes are asynchronous. Roughly speaking, delaying an asynchronous write is unimportant provided that both ends can handle enough outstanding writes to fill up the available bandwidth; by definition, no one is stalling for a specific asynchronous write to complete, so all we need is to avoid a general stall. So for an initial R2T to make an actual performance difference we need to be dealing with a situation where this is not true, where either the writes are synchronous or the systems cannot handle enough asynchronous writes to fill up the bandwidth.

But neither case is enough by itself, because what we also require is that waiting for an R2T adds visible amounts of time to the overall request. Although I haven't looked at packet traces to verify this, I expect competent iSCSI targets to generate R2Ts basically instantly when they get an iSCSI WRITE PDU, and the typical local LAN packet latency is on the order of a tenth of a millisecond (assuming that the LAN from the target to the initiator is not saturated). This time is dwarfed by the time it takes to do disk IO with a physical disk (and to transfer significant amounts of information over a gigabit link).

Ergo, the R2T delay only matters when it starts rising to some visible fraction of the time that the rest of the SCSI WRITE takes, both the actual disk IO and the data transfer time. The easiest way to get this is with slow R2T response times, such as you might get over a high latency WAN link. In theory you might get this with a very fast disk subsystem on the target, but even then I think you'd have to be in an unusual situation for a tenth of a millisecond per write to matter.

(It's possible that this could matter if you are doing small random writes to a fast SSD. The smaller the writes are (and the faster they're serviced), the more outstanding writes you need in order to fill up the available bandwidth. I do not feel like doing the math right now to work out actual numbers for this, plus where are you getting more than 100 Mbytes/sec of small writes from?)

Oh well. I suppose this simplifies our theoretical future iSCSI tuning efforts.

WhenISCSIR2TMatters written at 02:03:24; Add Comment

2011-05-02

The apparent origins of some odd limitations in the iSCSI protocol

The iSCSI protocol has some odd features and defaults; yesterday I grumbled about how InitialR2T defaults to 'yes', for example. In many ways it is not the sort of protocol that you would design if you were going to do a TCP-based remote block access protocol, even setting aside the idea of transporting SCSI commands across the network.

Now, I wasn't there at the time, so I have no idea what the real reasons were for these protocol decisions; all I can do is guess. But what it certainly looks like from the outside is that these decisions were made in order to make it possible to make relatively inexpensive, relatively dumb (and possibly hardware-accelerated) iSCSI target implementations. Take the issue of 'Ready to Transfer' (R2T) messages from the target to the initiator. By requiring R2T messages, a target can pre-allocate limited receive buffers and then strictly control the flow of data into them; it knows that it can never receive valid data that it has not already allocated a buffer for, because it allocates the buffer before it sends the R2T. This is a perfect feature for things with limited resources and hardware that wants to do direct DMA transfers, but it's not how most TCP protocols usually work.

(Of course, this sort of decision harks back to SCSI itself, which also has the 'target tells you when to send write data' feature (among other things). But this was a sensible decision for SCSI, which operated in a quite different and more direct environment than a TCP stream and with very limited hardware on the disks (well, at least initially). In SCSI you really could DMA the incoming data directly from the wire to the data buffers (and then on to disk) without having to do other work. This is not so true in a TCP-based protocol, which has to decode TCP headers and reassemble the TCP stream before it can even start with such things.)

I can see why iSCSI wants to have this sort of feature available (in part, it enables building simple iSCSI target implementations that transport iSCSI commands more or less directly to physical disks). But I really think that iSCSI should have been specified so that these features were not the default, that the starting assumption was that you had fully intelligent initiators and targets and that you wanted the best performance possible by default. Although I have not looked at the protocol in detail, my guess is that this might also have added some additional features to the protocol, things like dynamic control of 'receive windows' for write data.

PS: I don't think that ATA-over-Ethernet does any better than iSCSI here. While simpler in some respects, AOE has its own protocol issues.

Sidebar: why iSCSI doesn't need R2T and so on for read data

It might strike you that there is an odd asymmetry in iSCSI; write operations require permission from the target before the initiator sends data, but read operations do not require permission from the initiator before the target sends data. The difference is that the initiator already controls the amount and timing of incoming read data, because it made the read request to start with. The equivalent of a read R2T is the read request itself. Write requests are different because the target doesn't initiate them and so can get hit with arbitrary requests with arbitrary amounts of data at random times.

I tend to think that this does have some drawbacks for low-resource initiators (they must artificially fragment a contiguous read stream in order to limit the incoming data), but it makes for a simpler target implementation (the target doesn't have to keep a bunch of buffered data sitting around until the initiator allows it to be offloaded) and I suspect that this was what was on the minds of the people creating the iSCSI protocol.

ISCSIProtocolLimitations written at 01:16:39; Add Comment

By day for May 2011: 2 5 17 19 22 23; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.