Wandering Thoughts archives

2014-04-30

Failover versus sparing in theory and in practice

Suppose that you have a fileserver infrastructure with some number of physical servers, a backend storage network, and some number of logical fileservers embodied on top of all of this. Broadly speaking, there are two strategies you can follow if one of those physical servers has problems. You can fail the logical fileserver the physical server hosts over to another machine, perhaps a hot spare server, or you can replace the physical host in place with some amount of spare hardware, for example by simply removing the system disks and putting them in a new server unit. Let's call these two options 'failover' and 'sparing'.

In theory, failover has a bunch of advantages, like that you can do it without physical access to the machines and that it survives more host failures (eg the system disks dying or the installed system getting corrupted). Also in theory our fileserver environment was deliberately engineered to support failover, for example by having the idea of 'logical fileservers' at all. In practice we've basically abandoned the use of failover; when serious hardware problems emerge our answer is almost always to spare the hardware out. There are at least two reasons for this.

First, failover in our environment is very slow. An ordinary ZFS pool import in an iSCSI environment with multiple pools and many iSCSI disks is impressively slow to start with, plus each fileserver has several pools to bring up, plus the other work of adding IP aliases and so on. In practice a failover takes long enough to qualify as 'achingly slow' and also significantly disruptive for NFS clients.

(I believe that we've also had issues with things like NFS lock state not fully recovering after a failover attempt. Possibly this could be worked around if we did the right things.)

Second, our backups are tied to the real hosts instead of the logical fileservers. Failing over a fileserver to a different real host for any length of time means that the backup system needs extensive and daunting mangling (or alternately we live with it abruptly doing full backups of terabytes of 'new' filesystems, throwing off the backup schedule for existing ones). This makes failover massively disruptive in practice for anything beyond short term things (where by 'short term' I mean 'before the next nightly backups run').

By contrast swapping fileserver hardware is easy, relatively fast, and is pretty much completely reliable unless the installed system has become corrupted somehow. To both the server and the clients it just looks like an extended crash or other downtime and things recover as well as they ever do from that. So far the only tricky bit about such hardware shifts has been getting the system to accept the 'new' Ethernet devices as its proper Ethernet devices.

We'll probably keep our current design on our new fileserver hardware, complete with the possibility for failover of a logical fileserver. But I don't expect it to work any better than before so we'll probably keep doing physical sparing of problem hardware even in the future.

(One thing writing this entry has pointed out to me is that we ought to work out a tested and documented procedure for transplanting system disks from machine to machine under OmniOS and our new hardware. Sooner or later we'll probably need it.)

FailoverVersusSparing written at 23:20:38; Add Comment

Backup systems, actual hosts, and logical hosts

One of the little but potentially important differences between backup systems is whether they can back up logical hosts or if, for one reason or another, they can only back up actual hosts. Since this sounds like a completely abstract situation, let's set up a concrete one.

Let's suppose that you have three fileserver hosts, call them A, B, and C, and two logical fileservers, fs1 and fs2 (and some sort of movable or shared storage system behind A, B, and B). Actual filesystems are associated with a logical fileserver while each logical fileserver is hosted on a particular machine (with one left over for a spare).

If your backup system will back up logical hosts, you can tell it 'back up fs1:/a/fred and fs2:/b/barney', have this work, and have the backup system associate things like index metadata about what file is in what backup run with these logical names. This is what you want because it means your backup system doesn't care which physical host fs1 and fs2 are on, which in turn makes it much easier to move fs1 from A to C in an emergency. However if your backup system insists on dealing with real hosts then you must tell it 'back up A:/a/fred and B:/b/barney', all of the index metadata and so on is associated with A and B, and the backup system will either explode or require manual attention if /a/fred ever winds up on C. This is obviously not really very desirable.

You might think that of course a backup system will back up logical hosts instead of insisting on real hosts. In practice there are all sorts of ways for a backup system to quietly need real hosts. Does the client software send the local hostname to the server as part of the protocol? Does the client software make network connections to the server and the server use the IP address those connections come from to do stuff like verify access rights, connect incoming backup streams to requested backups, or the like? Then your backup system might be implicitly requiring you to use real hosts.

(Even if the backup system theoretically copes with backing up logical hosts it may have limitations that will cause problems if two logical hosts ever wind up on the same real host or if you try to back up both the logical host and some stuff on the real host. This split between logical hosts and real hosts is a corner case and it exposes any number of potential issues.)

BackupHostsRealOrLogical written at 01:19:11; Add Comment

2014-04-14

Chasing SSL certificate chains to build a chain file

Supposes that you have some shiny new SSL certificates for some reason. These new certificates need a chain of intermediate certificates in order to work with everything, but for some reason you don't have the right set. In ideal circumstances you'll be able to easily find the right intermediate certificates on your SSL CA's website and won't need the rest of this entry.

Okay, let's assume that your SSL CA's website is an unhelpful swamp pit. Fortunately all is not lost, because these days at least some SSL certificates come with the information needed to find the intermediate certificates. First we need to dump out our certificate, following my OpenSSL basics:

openssl x509 -text -noout -in WHAT.crt

This will print out a bunch of information. If you're in luck (or possibly always), down at the bottom there will be a 'Authority Information Access' section with a 'CA Issuers - URI' bit. That is the URL of the next certificate up the chain, so we fetch it:

wget <SOME-URL>.crt

(In case it's not obvious: for this purpose you don't have to worry if this URL is being fetched over HTTP instead of HTTPS. Either your certificate is signed by this public key or it isn't.)

Generally or perhaps always this will not be a plain text file like your certificate is, but instead a binary blob. The plain text format is called PEM; your fetched binary blob of a certificate is probably in the binary DER encoding. To convert from DER to PEM we do:

openssl x509 -inform DER -in <WGOT-FILE>.crt -outform PEM -out intermediate-01.crt

Now you can inspect intermediate-01.crt in the same to see if it needs a further intermediate certificate; if it does, iterate this process. When you have a suitable collection of PEM format intermediate certificates, simply concatenate them together in order (from the first you fetched to the last, per here) to create your chain file.

PS: The Qualys SSL Server Test is a good way to see how correct your certificate chain is. If it reports that it had to download any certificates, your chain of intermediate certificates is not complete. Similarly it may report that some entries in your chain are not necessary, although in practice this rarely hurts.

Sidebar: Browsers and certificate chains

As you might guess, some but not all browsers appear to use this embedded intermediate certificate URL to automatically fetch any necessary intermediate certificates during certificate validation (as mentioned eg here). Relatedly, browsers will probably not tell you about unnecessary intermediate certificates they received from your website. The upshot of this can be a HTTPS website that works in some browsers but fails in others, and in the failing browser it may appear that you sent no additional certificates as part of a certificate chain. Always test with a tool that will tell you the low-level details.

(Doing otherwise can cause a great deal of head scratching and frustration. Don't ask how I came to know this.)

SSLChasingCertChains written at 22:02:04; Add Comment

2014-04-10

My current choice of a performance metrics system and why I picked it

In response to my previous entries on gathering OS level performance metrics, people have left a number of comments recommending various systems for doing this. So now it's time to explain my current decision about this.

The short version: I'm planning to use graphite combined with some stats-gathering frontend, probably collectd. We may wind up wanting something more sophisticated as the web interface; we'll see.

This decision is not made from a full and careful comparison of all of the available tools with respect to what we need, partly because I don't know enough to make that comparison. Instead it's made in large part based on what seems to be popular among relatively prominent and leading edge organizations today. Put bluntly, graphite appears to be the current DevOps hotness as far as metrics goes.

That it's the popular and apparent default choice means two good things. First, given that it's used by much bigger environments than we are I can probably make it work for us, and given that the world is not full of angry muttering about how annoying and/or terrible it is it's probably not going to be particularly bad. Second, it's much more likely that such a popular tool will have a good ecology around it, that there will be people writing howtos and 'how I did this' articles for it and add on tools and so on. And indeed this seems to be the case based on my trawling of the Internet so far; I've tripped over far more stuff about graphite than about anything else and there seem to be any number of ways of collecting stats and feeding it data.

(That graphite's the popular choice also means that it's likely to be kept up to date, developed further, possibly packaged for me, and so on.)

A side benefit of this reading is that it's shown me that people are pushing metrics into a graphite-based system at relatively high rates. This is exactly what I want to do given that averages lie and the shorter period you take them over the better for avoiding some of those lies.

(I'm aware that we may run into things like disk IO limits. I'll have to see, but gathering metrics say every five or ten seconds is certainly my goal.)

Many of the alternatives are probably perfectly good and would do decently well for us. They're just somewhat more risky choices than the current big popular thing and as a result they leave me with various concerns and qualms.

MetricsSystemChoice written at 01:01:20; Add Comment

2014-04-08

My goals for gathering performance metrics and statistics

I've written before that one of my projects is putting together something to gather OS level performance metrics. Today I want to write down what my goals for this are. First off I should mention that this is purely for monitoring, not for alerting; we have a completely separate system for that.

The most important thing is to get visibility into what's going on with our fileservers and their iSCSI backends, because this is the center of our environment. We want at least IO performance numbers on the backends, network utilization and error counts on the backends and the fileservers, perceived IO performance for the iSCSI disks on the fileservers, ZFS level stats on the fileservers, CPU utilization information everywhere, and as many NFS level stats as we can conveniently get (in a first iteration this may amount to 'none'). I'd like like to have both a very long history (half a year or more would be great) and relatively fine-grained measurements, but in practice we're unlikely to need fine-grained measurements very far into the past. To put it one way, we're unlikely to try to troubleshoot in detail a performance issue that's more than a week or so old. At the same time it's important to be able to look back and say 'were things as bad as this N months ago or did they quietly get worse on us?', because we have totally had that happen. Long term stats are also a good way to notice a disk that starts to quietly decay.

(In general I expect us to look more at history than at live data. In a live incident we'll probably go directly to iostat, DTrace, and so on.)

Next most important is OS performance information for a few crucial Ubuntu NFS clients such as our IMAP servers and our Samba servers (things like local IO, NFS IO, network performance, and oh sure CPU and memory stats too). These are very 'hot' machines, used by a lot of people, so if they have performance problems we want to know about it and have a good shot at tracking things down. Also, this sort of information is probably going to help for capacity planning, which means that we probably also want to track some application level stats if possible (eg the number of active IMAP connections). As with fileservers a long history is useful here.

Beyond that it would be nice to get the same performance stats from basically all of our Ubuntu NFS clients. If nothing else this could be used to answer questions like 'do people ever use our compute servers for IO intensive jobs' and to notice any servers with surprisingly high network IO that might be priorities for moving from 1G to 10G networking. Our general Ubuntu machines can presumably reuse much or all of the code and configuration from the crucial Ubuntu machines, so this should be relatively easy.

In terms of displaying the results, I think that the most important thing will be an easy way of doing ad-hoc graphs and queries. We're unlikely to wind up with any particular fixed dashboard that we look at to check for problems; as mentioned, alerting is another system entirely. I expect us to use this metrics system more to answer questions like 'what sort of peak and sustained IO rates do we typically see during nightly backups' or 'is any backend disk running visibly slower than the others'.

I understand that some systems can ingest various sorts of logs, such as syslog and Apache logs. This isn't something that we'd do initially (just getting a performance metrics system off the ground will be a big enough project by itself). The most useful thing to have for problem correlation purposes would be markers for when client kernels report NFS problems, and setting up an entire log ingestion system for that seems a bit overkill.

(There are a lot of neat things we could do with smart log processing if we had enough time and energy, but my guess is that a lot of them aren't really related to gathering and looking at performance metrics.)

Note that all of this is relatively backwards from how you would do it in many environments, where you'd start from application level metrics and drill downwards from there because what's ultimately important is how the application performs. Because we're basically just a provider of vague general computing services to the department, we work from the bottom up and have relatively little 'application' level metrics we can monitor.

(With that said, it certainly would be nice to have some sort of metrics on how responsive and fast the IMAP and Samba servers were for users and so on. I just don't know if we can do very much about that, especially in an initial project.)

PS: There are of course a lot of other things we could gather metrics for and then throw into the system. I'm focusing here on what I want to do first and for the likely biggest payoff. Hopefully this will help me get over the scariness of uncertainty and actually get somewhere on this.

StatsGatheringGoals written at 00:45:11; Add Comment

2014-04-03

The scariness of uncertainty

One of the issues that I'm facing right now (and have been for a while) is that being uncertain can be a daunting thing. As sysadmins we deal with uncertainty all of the time, of course, and if we were paralyzed by it in general we'd never get anywhere. It's usually easy enough to overcome uncertainty and move forward in small situations or important situations (for various reasons). Where uncertainty can dig in is in dauntingly big and complex projects that are not essential. If you don't have to have whatever and building anything is clearly a lot of work for an uncertain reward, it's very easy to defer and defer action in favour of various stalling measures (or other work).

All of this sounds rather hand waving, so let me tell you about my project with gathering OS level performance statistics. Or rather my non-project.

If you look around, there are a lot of options for gathering, aggregating, and graphing OS performance stats (in tools, full systems, and ecologies of tools). Beyond a certain basic level it's unclear which ones of them are going to work best for us and which ones will be crawling failures, but at the same time it's also clear that any of them that look good are going to take a significant amount of work and time to set up and try out (and I'm going to have to try them in production).

As a result I have been circling around this project for literally years now. Every so often I poke and prod at the issue; I read more about some tool or another, I look at pretty pictures, I hear about something new, and so on and so forth. But I've never sat down to really do something. I've always found higher priority things to do or other excuses.

(Here in the academy this behavior in graduate students is well known and gets called 'thesis avoidance'.)

The scariness of uncertainty is not the only reason for this, of course, but it's a significant contributing factor. In a way it raises the stakes for making a choice.

(The uncertainty comes from two directions. One is simply trying to select which system to use; the other is whether not the whole idea is going to be worthwhile. The latter is a bit stupid since we're probably not going to be left with a white elephant of a system that we ignore and then quietly abandon, but the possibility gnaws at me and feeds other uncertainties and doubts.)

I don't have any answers, but maybe writing this entry has made it more likely that I do something here. And maybe I should embrace the possibility of failure as a sign that I am finally taking enough risk.

(I feel divided about that idea but I need to think about it more and then write another entry on it.)

UncertaintyScariness written at 00:34:47; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.