A sysadmin mistake: shooting your virtual foot off

May 14, 2010

Here is a mistake that we've actually made more than once.

We have NFS fileservers, and to enable basic NFS server failover) each of them has both its real hostname (and IP) and a virtual IP alias. The real hostname is relatively long (we use the names of cities that start with 'san'); the fileserver's virtual hostname is short ('fsN' for some single-digit N). The result is that when we log into machines, we almost always use the virtual hostname since it's shorter, easier to remember, and what we care about.

(Sometimes when we can't recall which physical machine is which fileserver, we actually work it out by logging in to fsN and seeing what hostname it has. Hey, it's easier and faster than any other method I can think of.)

Suppose that we want to take a virtual fileserver's IP alias off the network, either to move it between physical servers or because we're about to do something that would cause NFS clients to get spurious 'permission denied' error messages on NFS operations if they could actually talk to the fileserver. So along we go; we log in to the machine, do various other prep work, and bring down the virtual IP:

$ ssh root@fs9
[... other stuff ...]
[root@sanwhatever-fs9]# ifconfig e1000g1:1 down

And suddenly our ssh session hangs. People sit around scratching their heads and worrying about the machine crashing until suddenly the light dawns: we just shot our virtual foot off. Well, it's more that we just sawed off the branch that we were standing on.

Oh sure, we were thinking 'we logged in to the machine and took down the virtual IP alias, why did our session hang?'. But that's not what we actually did. We logged in to the virtual IP alias, because that's what the convenient short name maps to. It's just that normally the difference between logging in to the machine via the virtual IP alias and its real IP address doesn't matter, so we forget this picky distinction. However, when you're logged in to an IP address and you take that IP address down, well, yes, you lose your connection.

(From one perspective, this is an example of an abstraction failing under a corner case. Normally we can use 'fsN' and 'sanwhatever' as the same thing, as an abstraction; this is one of the cases where we can't, but it's easy to forget because we're so accustomed to the abstraction.)

Written on 14 May 2010.
« Python exceptions for C programmers
Why we don't use jumbo frames for iSCSI: a cautionary tale on testing »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri May 14 02:46:55 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.