Wandering Thoughts archives

2013-11-30

The case of the disappearing ESATA disk

This is a mystery (ie I have no answers yet), and also a story of what I think is the perversity of hardware (I can't be sure yet). I'm writing it up partly because I rarely see sysadmins writing up our problems, with the result that I think it's easy to underestimate how weird things sometimes get out there.

We have a server with an external SATA disk enclosure. The enclosure has three port multiplier based (E)SATA channels, each with five drive bays on them; we currently have ten disks in the enclosure, all identical, taking up the full capacity of two channels. The server is running 64-bit Ubuntu 12.04. We recently moved the server from our test area to our production machine room, which was when we discovered the mystery: under specific circumstances, exactly one disk is not seen by the server.

If you power off the external enclosure and the server, the first time the server boots it will not see one specific disk bay on the enclosure. This is not just that the disk in the disk bay doesn't respond fast enough; the disk remains invisible no matter how long you let it sit. Rebooting the server will make the disk reappear, as will hotplugging the disk (pulling out its disk sled just enough to cut power, then pushing it back in). This doesn't happen if just the server itself is powered down; as long as the disk enclosure stays powered on, all is fine. So far this could be a whole list of things. Unfortunately this is where it gets weird. First, it's not the disk itself; we've swapped disks between bays and the problem stays with the specific bay. Next, it's not a straightforward hardware failure in the enclosure or anything directly related to it; at this point we've swapped the disk enclosure itself (with a spare), the ESATA cables, and the ESATA controller card in the server.

(To cut a long story short, it's quite possible that the problem has been there all along. Nor do we have any other copies of this model of disk enclosure around where we can be sure that they don't have the problem (since we have two more of these enclosures in production, this is making me nervous).)

One of the many things that really puzzles me about this is trying to come up with an explanation for why this could be happening. For instance, why does the disk become visible if we merely reboot the server?

I don't usually run into problems like these, which I'm generally very thankful for. But every so often something really odd comes up and apparently this is one of those times.

(Also, I guess power-fail tests are going to have to become a standard thing that we do before we put machines into production. If this kind of fault can happen once it can happen more than once, and we'd really like not to find out about it after the first time we have to power cycle all of this stuff in production.)

PS: Now you may be able to guess why I have a sudden new interest in how modern Linux assembles RAID arrays. It certainly hasn't helped testing that the drives have a RAID-6 array on them that we'd rather not have explode, especially when resyncs take about 24 hours.

Sidebar: Tests we should do

Since I've been coming up with these ideas in the course of writing this entry, I'm going to put them down here:

  • Reorder the ESATA cables (changing the mapping between ESATA controller card ports and the enclosure's channels). If the faulted bay changed to the other channel it would mean that the problem isn't in the enclosure but is something upstream.

  • 'Hotswap' another drive on the channel to see if the invisible disk then becomes visible due to the full channel reset et al.

I'm already planning to roll more recent kernels than the normal Ubuntu 12.04 one on to the machine to see what happens, but that's starting to grasp at straws.

DisappearingESATADisk written at 02:02:01; Add Comment

2013-11-25

Track your disk failures

Here is something that we've been learning the hard way: if you have any sort of fileserver environment with a significant number of disks (and maybe even if you don't), you should be tracking all of your disk failures. What this tracking is for is identifying failure patterns in your environment, things like whether certain sorts of disks fail more often, or disks in certain enclosures, and so on.

The very basic information you should record is full details for every disk failure. What I'd record today is when it happened, what sort of disk it was, what enclosure and bay it failed in, and how it failed (read errors, write errors, total death, IO got really slow, or however it happened). You might also want to track SMART attributes and note if you got any sort of SMART notices beforehand (in the extreme, you'd track SMART notices too). You might also be able to record how old the disk was (based on warranty status and perhaps date of manufacture information). This doesn't need any sort of complicated database system, a text file is fine, but you should record the main information in a way that it can be extracted with grep and awk.

(If you have external disk enclosures, keeping such a log may also raise the issue of consistent identification for them. Locally we have swapped some enclosures around when various things happen, which at the very least means you're going to want to note in the log that 'host X had its enclosure swapped here'.)

Once you have the core information logged you should also keep track of some aggregated failure information (instead of just having people to generate it on demand from the log). I would track at least failures by disk type and failures by enclosure, because these are the two things that are most likely to be correlated (ie, where one sort of disk is bad or one enclosure has a problem you may have overlooked). Update this aggregated information any time you add something to the log, either by hand or by auto-generating the aggregated stats from the log.

(This may sound obvious to some people but trust me, it's an easy thing to overlook or just not think about when you're starting out on a grand fileserver adventure.)

TrackYourDiskFailures written at 00:34:42; Add Comment

2013-11-15

Professional knowledge, certification, and regulation

It started with a tweet by Matt Simmons that got a reaction from me. Before I write a bunch of entries on this (instead of trying to cram even more complex thoughts into 140 characters) I want to talk about the significant differences I see between three basic things that are being talked about here (in general).

A body of professional knowledge is what it sounds like; in the case of system administration, it's some version of our accumulated experiences and wisdom. For now let's pretend that this body of knowledge will be descriptive ('when people did X this is what generally happened') instead of prescriptive ('do X instead of Y') and thus basically uncontroversial. I think that accumulating a body of knowledge is a noble endeavour but I also think that it's a lot of work, which means that it's not going to happen unless people find some way to pay for it.

(It's not enough for us to blog; blogging is to an established body of professional knowledge as research papers are to science textbooks. To have a real body of knowledge we need the equivalent of a textbook that people agree on. Putting everything together to create it is where a good part of the thankless hard work is.)

Certification sits on top of some body of professional knowledge. I see three broad levels of certification: certification that you have been exposed to the professional knowledge (this is the 'I got training' level of certification), certification that you know the professional knowledge, and certification that you can properly apply the professional knowledge. The latter sort of certification is the most useful to businesses, hiring managers, and so on (because it's what they really care about), but it also places a demand on the professional knowledge. To be able to meaningfully certify this you must be able to pose questions that have a (theoretically) objective correct answer and an incorrect answer, because that's what it takes to test people on it. This is fine for certain sorts of professional knowledge, where there already really is only one correct answer (eg, 'how do you use command X to do Y'). However it's my view that this is the least interesting thing to certify and what people really want to certify is much higher level and correspondingly much fuzzier about 'correctness' in its natural state.

(At this point I will note that a university degree is not certification in this sense. If it was we would not have all these stories about Computer Science graduates who can't program worth beans.)

Regulation is a significant step beyond mere certification where you basically get punished for not being certified or not doing things in the certified way, whether this is directly (by law) or tacitly (eg by increased liability). Unlike professional knowledge or certification, regulation is not something that can be done purely within a profession; it intrinsically needs the outside world to play along. Generally the outside world is at least a bit sceptical so this takes quite a bit of work one way or the other.

(The easiest way to get the outside world to care is for clearly slipshod work to kill people.)

As I see things there are major gulfs between all three of these things. The gulf between certification and regulation is obvious. The gulf between professional knowledge and strong certification is the distance from having 'best practices' to reaching consensus that some options are never valid.

KnowledgeCertsAndRegulation written at 02:06:37; Add Comment

2013-11-13

The cost of expensive hardware and the benefit of hindsight

One possible response to my entry about our discovery that we had dead chassis fans in our disk enclosures is to say that this is the cost of buying inexpensive enclosures; clearly we should have gotten a better grade of disk enclosures. This is literally true but to just leave it at that is to miss the big picture or actually several of them.

The first big picture is that did not buy these disk enclosures blindly. We bought one, opened it up, looked it over, tested it out, and when we liked what we saw during all of this we bought more. At no time during this evaluation process did it occur to anyone to say 'wait a minute, what happens if the fans start dying? how will we find out?'. As a result we didn't make any sort of conscious choice to live without chassis monitoring; instead it never crossed our minds that we might need it. I rather expect that this is a common thing; among other things, people have all sorts of cognitive biases to make us think that of course things always work.

(If you want to say that you always consider and monitor for fan failure, I'm going to ask you if all of your network switches alarm or report on fan failures, even the little ones and your bulk basic top of rack aggregator switches. Perhaps you buy expensive enough switches that this is true.)

The second big picture is that our disk enclosures work. They have worked for years and they continue to all work even today despite their seized fans. It's hard to avoid the objective conclusion that we made a good choice and possibly even the right choice even despite this silent fan failure issue. Have we had hard drives die in these disk enclosures? Yes, of course, but it's not clear if any potentially seized fans have caused more than usual to fail.

(Google's famous disk study actually found less correlation between drive temperature and failure rates than you might expect.)

But the most important big picture is about the costs of buying expensive hardware instead of less expensive hardware. Budgets are almost never infinite so the real cost of more expensive hardware is the opportunity cost. The money you spent on it is money that is not being spent elsewhere; it translates to buying less capacity, or not buying another piece of hardware for something else important, or less spares, or if your budget can be grown to cover all of that it means that someone else's budget gets cut to make up the shortfall and they get to do without something. More expensive hardware is never free; you are always giving up something to get it even if you can't see it. This is true even or especially if the more expensive hardware is objectively better than the cheaper hardware and objectively worth the extra cost. It's not just about whether you're paying a fair price for the extra benefits, it's about what you're giving up elsewhere to get them. And this is almost always situational; from the outside, you don't necessarily know what someone else's opportunity costs are.

(The most extreme version of this is when you have a fixed budget and a fixed objective and if you can't hit the objective within the budget there's no point in doing anything. At that point the perfect but over budget is very much the enemy of the just good enough but in budget.)

So even if we got to make the decision all over again, this time thinking about and knowing about the potential for fan failure, I'm not sure we'd choose any differently. It would be nice to have monitoring for the fans but I'm not sure it would be worth whatever else it would have cost us at the time.

(This is of course annoying to technical people, especially if we can see a risk that we are not mitigating. We want nice, good hardware. But our feelings of elegance play second fiddle to the needs of the organization and we must never forget that.)

CostOfExpensiveStuff written at 01:21:03; Add Comment

2013-11-10

Are those chassis fans actually still spinning?

This is kind of a sysadmin horror story. We have a number of external port-multiplier eSATA enclosures, used both for our iSCSI backends and for our disk-based backup system. These enclosures are now far from new, so we've had some mysterious failures with some of them. As a result of this we recently opened up a couple of them to poke around, with the hopes of reviving at least one chassis to full health.

These chassis have a number of internal fans for ventilation. What my co-workers found when they opened up these chassis was that some of these fans had seized up completely. At some point in the four or so years these enclosures have been operating, most of their fans had quietly died. We hadn't gotten any warnings about this because these enclosures are very basic and don't have any sort of overall health monitoring (if the fans themselves had been making noises before they died, we never noticed it over the general din of the machine room).

This is what you call an unfortunate unanticipated failure mode. In theory I suppose that we should have anticipated it; we knew from the start that there was no chassis health monitoring and fans do die eventually. In practice fan failures have been at least very uncommon on hardware where they actually get monitored (either directly or through 'fan failures are so bad the machine explodes') so we hadn't really thought about this before now.

Now, of course, we have a problem. We have a number of these chassis in live production service and we can't directly check the fans on a chassis without opening it up, which means taking them out of service. We may be able to indirectly observe the state of the fans by looking at hard drive temperatures, but there are a number of potential confounding effects there.

The larger scale effect of this is that I'm now nervously trying to think about any other fans that we're not directly monitoring and that we're just assuming are fine because the machine they're in hasn't died.

(Of course there's nothing new under the sun here; this is a variant of the well known 'do you actually get told if a disk dies in your redundant RAID array and your array stops being redundant any more' issue.)

AreYourFansSpinning written at 23:22:28; Add Comment

My views on network booting as an alternative to system disks

In a comment on my entry on the potential downsides of SSD as system disks, zwd asked I'd considered skipping the need for system disks by just PXE booting the systems instead (as some Illumos distributions are now recommending). The short answer is no but I have enough thoughts about this to warrant a long answer.

My view is that network booting systems is at its best where you have a large and mostly homogenous set of servers that basically run a constant set of things with little local state or local configuration. In this environment you don't want to bother taking the time to install to the local disks on today's server and it simplifies life if you can upgrade machines just by rebooting them. With little local state the difficulty of having state in a diskless environment doesn't cause too much heartburn in practice and running a constant set of programs generally reduces the load on your 'system filesystem' fileserver and may make it practical to have an all-in-RAM system image.

With that said, in general a diskless environment is almost intrinsically more complicated than local disks in theory and definitely more complicated in practice today. While you have a spectrum of options none of them are as simple and as resilient as local disks; they all require some degree of external support and create complications around things like software upgrades. Some of the options require significant infrastructure. All of them create additional dependencies before your servers will boot. In a large environment the simplifications elsewhere make up for this.

We aren't a large environment. In fact we're a very bad case for netbooting. Our modest number of systems are significantly heterogenous, they have potentially significant local state, a given system often runs a wide variety of software (very wide, for systems users log in to), and we don't want to reboot them at all in normal conditions. Some servers are already dependent on central NFS fileservers but other servers we very much want to keep working even if the fileserver environment has problems and of course the components of the fileserver environment are a crucial central point that we want to work almost no matter what with as few external dependencies as possible (ideally none beyond 'there is a network'). Single points of failure that can potentially take down much of our infrastructure give us heartburn. On top of this, diskless booting is not something that I believe is well supported by the majority of the OSes and Linux distributions that we use; we'd almost certainly be going off the beaten and fully supported path in terms of installation and system management (and might have to build some tools of our own).

In short: we'd save very little (or basically nothing) by using network booted diskless servers and get a whole bunch of problems to go with it. We'd need additional boot servers and relatively heavy duty fileservers to serve up the system filesystems and store 'local' state and we'd have non-standard system management that would be more difficult than we have today. Even if I felt enthused about this (which I don't) it would be a very hard sell to my co-workers; they would rationally ask 'what are we getting for all of this extra complexity and overhead?' and I would have no good answer.

(We don't install or reinstall systems anywhere near often enough that 'faster and easier installs' would be a good answer.)

NetbootingViews written at 02:38:17; Add Comment

2013-11-06

Why you might not want to use SSDs as system disks just yet

I wrote recently about our planned switch to using SSDs as system disks. This move may be somewhat more daring and risky than I've made it sound so far, so today I feel like running down all of the reasons I know that you might not want to do this (or at least do this just yet). Note that not all of this is deeply researched and in fact a bunch of it is taken from ambient stories and gossip that floats around.

The current state of hard drives is that they are mature technology. As mature technology the guts are basically the same between all hard drives, both different makes and different manufacturers, and the main 'enterprise' drive upsell is often about the firmware (and somewhat about how you connect to them). As such the consumer 7200 RPM SATA drive you can easily buy is mostly the same as an expensive 7200 RPM nearline SAS drive. Which is, of course, why so many people buy more or less consumer SATA drives for all sorts of production use.

My impression is that this is not the case for SSDs. Instead SSDs are a rapidly evolving immature technology, with all that that implies. SSDs are not homogenous; they vary significantly between manufacturers, product lines, and even product generations. Unlike hard drives you can't assume that any SSD you buy from a major player in the market will be decent or worth its price (but on the other hand it can sometimes be an underpriced gem). There are also real and significant differences between 'enterprise' SSDs and ordinary consumer SSDs; the two are not small variants of each other and ordinary consumer SSDs may not be up to actual production usage in server environments.

You can find plenty of horror stories about specific SSDs out there. You can also find more general horror stories about SSD behavior under exceptional conditions; one example I've seen recently is Understanding the Robustness of SSDs under Power Fault [PDF] (from FAST '13), which is about what it says. Let's scare everyone with a bit from the abstract:

Our experimental results reveal that thirteen out of the fifteen tested SSD devices exhibit surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes, metadata corruption, and total device failure.

Most of their SSDs were from several years ago so things may be better with current SSDs. Or maybe not. We don't necessarily know and that's part of the problem with SSDs. SSDs are very complex devices and vendors have every reason to gloss over inconvenient details and (allegedly) make devices that lie about things to you so that they look faster or healthier.

(It's widely reported that some SSDs simply ignore cache flush commands from the host instead of dutifully and slowly committing pending writes to the actual flash. And we're not talking about SSDs that have supercapacitors so that they can handle power failures.)

On a large scale level none of this is particularly surprising or novel (not even the bit about ignoring cache flushes). We saw the same things in the hard drive industry before it became a mature field, including manufacturers being 'good' or 'bad' and there being real differences between the technology of different manufacturers and between 'enterprise' and consumer drives. SSDs are just in the early stages of the same process that HDs went through in their time.

Ultimately that's the large scale reason to consider avoiding SSDs for casual use, such as for system drives. If you don't actively need them or really benefit from them, why take the risks that come from being a pioneer?

(This is the devil's advocate position and I'm not sure how much I agree with it. But I put the arguments for SSDs in the other entry.)

SSDsWhyNotSystemDisks written at 23:06:29; Add Comment

2013-11-03

Why we're switching to SSDs for system disks

A commentator on my entry on our future fileserver hardware asked a good question, namely why we're planning to use SSDs for system disks. This is actually likely part of a general shift for us (we've already done it on some new servers). The short version of why is that it is less 'why' and more 'why not'.

SATA hard drives seem to basically have a floor price. 3.5" or 2.5", small or moderately large, you simply can't easily get a decent 7200 rpm drive for less than about $50-$60 (at least in small quantities) no matter how little space you want. The amount of space you get for your $60 has been steadily increasing, but the price has not dropped for small space (instead small HDs seem to have progressively disappeared). As the price per GB of SSDs has shrunk, SSDs that are more than big enough to be OS system disks have now reached this magic $60 price point.

Raw speed and random seek times are usually not an issue for system disks (although we have at least one server where they actually do matter; said server is now using SSDs). However our general assumption is that SSDs are likely to be more reliable than HDs because SSDs don't have that spinning rust hurtling around and around (they also don't get as hot). And being fast doesn't exactly hurt. Since we can get big enough SSDs for the same price as more than big enough HDs, we might as well go with SSDs when it's convenient.

(The relative long-term durability of SSDs versus HDs is at least somewhat uncertain (and they're probably going to fail in different ways). Our HDs have generally lasted as long as we could ask and SSDs have issues like write wear and so on. But on the whole it seems worth taking the chance, especially since there are some benefits.)

SSDsAsSystemDisks written at 02:08:03; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.