Wandering Thoughts archives

2014-03-24

The importance of having full remote consoles on crucial servers

One of our fileservers locked up this evening for completely inexplicable reasons (possibly it had simply been up too long). These fileservers are still SunFire X2200s and I wound up diagnosing the problem and rebooting the server using the X2200's built in lights out management and remote console over IP functionality (often known as 'KVM over IP'). While I could have power cycled the machine without the ILOM (it's on a smart PDU that we can also control), having the KVM over IP available did two important things here. The first was that it let me establish that the machine was definitively hung and had not printed any useful messages to the console. The second was that I had very strong assurance that I could do almost anything possible to recover the machine if it didn't come up cleanly after the power cycle; not only did I have console access to Solaris, I would have console access to the GRUB boot menu and the BIOS if necessary (for example to force the boot drive).

I could have gotten some of that with a serial console, perhaps a fair amount of it if the BIOS also supported it. But let's be honest here; even with the BIOS's cooperation, a serial console is not as good and as complete as KVM over IP. And a serial console pretty much lacks the out of band management for things like forced power cycles and checking ILOM logs.

I've traditionally considered KVM over IP features to be a nice luxury but not really a necessity. After this incident I'm not sure I agree with that position any more. Certainly for many of our servers they're still not really essential; if one of our login or compute servers has problems, well, we have several of them. But for crucial core servers like fileservers, servers that we can't live without, I think it's a different matter. There we want to be able to do as much as possible remotely and for that KVM over IP is really important. Would I pay extra for it? I'd like to think that I'd now argue for that and say that it's worth some extra money per server (either for a server model that offers it or for license keys to enable it, depending on the server).

(I'd be happy to take KVM over IP on all of our servers but in our money constrained environment I don't think I'd pay extra for it on many of them.)

I'm now also very happy that our new fileserver hardware has full KVM over IP support for free. It wasn't a criteria when we were evaluating hardware so we got lucky here, but I'm glad that we did.

(And I've used our new hardware's SuperMicro KVM over IP and lights out management, so I can say that it works.)

By the way, my personal opinion is that the importance of KVM over IP goes up if your servers are not at your work but instead in a colocation facility or the like. Then any physical visit to the servers is a trek, instead of just out of hours visits. In an environment with actual ROI, it shouldn't take many sysadmin-hours spent on trips to the data center to equal the extra costs of KVM over IP capable hardware.

(I've written some praise for KVM over IP before, but back then I was focusing on (re)installs instead of disaster recovery because I hadn't yet had a situation like this happen to me.)

sysadmin/KVMOverIPImportanceII written at 23:19:50; Add Comment

Why I don't trust transitions to single-user mode

When I talked about how avoiding reboots should not become a fetish I mentioned that I trusted rebooting a server more than bringing it to single user mode and then back to multiuser. Today I feel like amplifying this.

The simple version is that it's easy for omissions to hide in the 'stop' handling of services if they are not normally stopped and restarted. When you reboot the machine after the 'stop' stuff runs, the reboot hides these errors. If you don't quite completely clean up /var/run or reset your state or whatever, well, rebooting the machine wipes all of that away and gives your 'start' scripts a clean slate. Similarly, there's potential issues in that transitioning from single user to multiuser mode doesn't have quite the same environment as booting the system or restarting a service in multiuser mode; bugs and omissions could lurk here too.

This is a specific instance of a general cautious view I have. There is nothing that forces a multiuser to single user to back to multiuser transition to be correct, since it's not done very often. Therefor I assume that there at least could be omissions. Of course these omissions are bugs, but that's cold comfort if things don't work right.

I also wouldn't be surprised if some services don't even bother to have real 'stop' actions. There are certainly some boot time actions that don't really have a clear inverse, and in general if you expect a service to never be restarted it's at least tempting to not go through all of the hassle. Perhaps I'm being biased by some of our local init service scripts which omit 'stop' actions for this reason.

(A related issue with single user mode is an increasing disagreement between various systems about just what services should be running in it. There was a day when single user mode just fsck'd the disks, mounted at least some local filesystems, and gave you a shell. Those days are long over; at this point any number of things may wind up running in order to provide what are considered necessary services.)

sysadmin/SingleUserTransitionDistrust written at 02:50:59; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.