2014-03-28
How we wound up with a RFC 1918 IP address visible in our public DNS
This is kind of a war story.
The whole saga started with a modern, sophisticated, Internet enabled projector, one that supports 'network projection' where you use software to feed it things to display instead of connecting to a VGA port or the like. This is quite handy because an increasing number of devices that people want to do presentations from simply do not have a spare VGA port, for example tablets. This network projection requires special software and, as we found out, this software absolutely does not work if there is NAT'ing in the way between your device and the projector. Unfortunately in our environment this is a real problem for wireless devices (such as those tablets) because there is no way off our wireless network without going through a NAT gateway of some sort.
(One of many reasons that this is required is that the wireless network uses RFC 1918 IP address space.)
If getting off the wireless network requires NAT and the software can't work with NAT, the conclusion is simple: we have to put the data projector on the wireless network (on what is amusingly called the 'wired wireless'). Wireless devices can talk to it, wired devices can talk to it by plugging into a little switch next to it, and everything is happy. But what about DNS? People would like to connect to the data projector by name, not just by IP address.
Like many places we have a 'split horizon' DNS setup, with internal DNS and public DNS. People using our VPN to authenticate on the wireless network and get access to internal services use the internal DNS servers, which are already full of RFC 1918 IP addresses for machines in our sandboxes. Unfortunately it's also possible to register wireless devices for what we call the 'airport experience', where we give devices external connectivity to the campus but no special access to our internal networks (as we feel that wireless MAC addresses aren't sufficient authentication for internal network access).
Devices using the airport experience can't use our internal DNS servers, partly because many of the IP addresses that the DNS servers would return can't be used outside our internal networks. Instead they get DNS from general campus recursive DNS servers, which of course use our public DNS data. Yet these devices still need to be able to look up the name for the data projector and get the wireless network's RFC 1918 IP address for it so they can talk to it directly with no NATing. The simplest, lowest overhead way to do this was to put the RFC 1918 wireless IP address for the data projector into our public DNS.
And that is why our public DNS now has a DNS record with a RFC 1918 IP address.
(I confessed to this today on Twitter so I decided that I might as well tell the story here.)
PS: people will probably suggest dnsmasq as a possible solution. It might be one but we aren't already using it, so at a minimum it'd be much more work than adding a DNS entry to our public DNS.
2014-03-26
The DNS TTL problem
It all started with a tweet by @Twirrim:
DNS TTL records exist for a reason. For the love of all that is holy, honour them. Don't presume to think you know better.
On the one hand, as a sysadmin I'm in full agreement with this view. I certainly want all of the DNS caches and recursive DNS servers out there to respect the TTLs we set on our DNS entries and it makes me irritated when people don't. On the other hand I also have to sympathize with the operators of DNS caches out there, because I rather suspect that there are a huge number of mis-set TTLs in practice.
The problem with DNS TTLs is that they are almost always an example of information that doesn't have to be correct, and we all know what eventually happens to such information. Most people's DNS entries change very rarely and are not looked up in any huge volume, so it doesn't really matter what TTLs they have. If they have the minimum TTL you won't notice the extra lookup volume and if they have an absurdly long TTL you won't notice the lingering old entries because you aren't changing your DNS entries anyways.
(And I'm not throwing stones here. We have a number of DNS entries with short TTLs that haven't changed for years in our zones, more or less just because. It would take work to go back through our zones, find them all, verify that we really don't need short TTLs any more, and take them out. It's simpler to let them sit there and it doesn't do us any harm.)
But I bet that operators of large scale DNS caches notice those things. I rather suspect that they get customer complaints when someone updates their DNS except that they had really long TTLs on the old entries and now the customers can't get to the new servers because the old entries are stilled cached. And I suspect that they notice the extra load from short TTLs forcing useful DNS entries to be discarded even when said DNS entries haven't actually been changed in the past year. I also suspect that there are more people doing DNS TTLs somewhat wrong than there are people doing them completely right. So I can see the engineering logic in overriding DNS TTLs in your large scale cache, however inconvenient it is for me as a sysadmin.
I don't have any answers to this and in a sense there are no answers. By that I mean that the large scale DNS caches that are currently monkeying around with people's DNS TTLs are not going to change their behavior any time soon, so the most I can do is live with it.
(Then there is the thornier issue of DNS lookups being remembered by
long running programs that may have no idea of TTLs at all; instead they
did a getaddrinfo() once and have held on to the result ever since. I
suspect that web browsers no longer fall into this category, although
they once did.)
2014-03-24
The importance of having full remote consoles on crucial servers
One of our fileservers locked up this evening for completely inexplicable reasons (possibly it had simply been up too long). These fileservers are still SunFire X2200s and I wound up diagnosing the problem and rebooting the server using the X2200's built in lights out management and remote console over IP functionality (often known as 'KVM over IP'). While I could have power cycled the machine without the ILOM (it's on a smart PDU that we can also control), having the KVM over IP available did two important things here. The first was that it let me establish that the machine was definitively hung and had not printed any useful messages to the console. The second was that I had very strong assurance that I could do almost anything possible to recover the machine if it didn't come up cleanly after the power cycle; not only did I have console access to Solaris, I would have console access to the GRUB boot menu and the BIOS if necessary (for example to force the boot drive).
I could have gotten some of that with a serial console, perhaps a fair amount of it if the BIOS also supported it. But let's be honest here; even with the BIOS's cooperation, a serial console is not as good and as complete as KVM over IP. And a serial console pretty much lacks the out of band management for things like forced power cycles and checking ILOM logs.
I've traditionally considered KVM over IP features to be a nice luxury but not really a necessity. After this incident I'm not sure I agree with that position any more. Certainly for many of our servers they're still not really essential; if one of our login or compute servers has problems, well, we have several of them. But for crucial core servers like fileservers, servers that we can't live without, I think it's a different matter. There we want to be able to do as much as possible remotely and for that KVM over IP is really important. Would I pay extra for it? I'd like to think that I'd now argue for that and say that it's worth some extra money per server (either for a server model that offers it or for license keys to enable it, depending on the server).
(I'd be happy to take KVM over IP on all of our servers but in our money constrained environment I don't think I'd pay extra for it on many of them.)
I'm now also very happy that our new fileserver hardware has full KVM over IP support for free. It wasn't a criteria when we were evaluating hardware so we got lucky here, but I'm glad that we did.
(And I've used our new hardware's SuperMicro KVM over IP and lights out management, so I can say that it works.)
By the way, my personal opinion is that the importance of KVM over IP goes up if your servers are not at your work but instead in a colocation facility or the like. Then any physical visit to the servers is a trek, instead of just out of hours visits. In an environment with actual ROI, it shouldn't take many sysadmin-hours spent on trips to the data center to equal the extra costs of KVM over IP capable hardware.
(I've written some praise for KVM over IP before, but back then I was focusing on (re)installs instead of disaster recovery because I hadn't yet had a situation like this happen to me.)
Why I don't trust transitions to single-user mode
When I talked about how avoiding reboots should not become a fetish I mentioned that I trusted rebooting a server more than bringing it to single user mode and then back to multiuser. Today I feel like amplifying this.
The simple version is that it's easy for omissions to hide in the
'stop' handling of services if they are not normally stopped and
restarted. When you reboot the machine after the 'stop' stuff runs,
the reboot hides these errors. If you don't quite completely clean
up /var/run or reset your state or whatever, well, rebooting the
machine wipes all of that away and gives your 'start' scripts a
clean slate. Similarly, there's potential issues in that transitioning
from single user to multiuser mode doesn't have quite the same
environment as booting the system or restarting a service in multiuser
mode; bugs and omissions could lurk here too.
This is a specific instance of a general cautious view I have. There is nothing that forces a multiuser to single user to back to multiuser transition to be correct, since it's not done very often. Therefor I assume that there at least could be omissions. Of course these omissions are bugs, but that's cold comfort if things don't work right.
I also wouldn't be surprised if some services don't even bother to have real 'stop' actions. There are certainly some boot time actions that don't really have a clear inverse, and in general if you expect a service to never be restarted it's at least tempting to not go through all of the hassle. Perhaps I'm being biased by some of our local init service scripts which omit 'stop' actions for this reason.
(A related issue with single user mode is an increasing disagreement
between various systems about just what services should be running
in it. There was a day when single user mode just fsck'd the
disks, mounted at least some local filesystems, and gave you a
shell. Those days are long over; at this point any number of things
may wind up running in order to provide what are considered necessary
services.)
2014-03-22
Avoiding reboots should not become a fetish
Unix is designed so that you shouldn't normally need to reboot it to fix problems and in most environments it's considered good practice to stick with this and not reboot Unix machines casually, or even very much at all. People have rightfully mocked the approach in other systems of rebooting as a routine troubleshooting step (often an early one, sometimes the first one). Unfortunately it's quite possible and in fact not uncommon to take this attitude too far and make not rebooting into a fetish. The symptoms of this fetish are fairly straightforward; people afflicted by it would rather do almost anything than reboot a machine, no matter how time consuming, obscure, or difficult it is. They will confidently assert that rebooting is never the right answer and is basically always a last resort, done only after you've exhausted other options.
Reality is a bit different. In reality, sometimes rebooting is the right answer even if it is not mathematically speaking necessary (by which I mean 'essential'). In pragmatic system administration, rebooting can be easier, more reliable, or simply more certain in the face of various forms of uncertainty. Ultimately the 'don't reboot' fetish has confused a means with an end.
The real goal is avoiding user and service disruption, or at least minimizing it. Not rebooting machines is a means to this end, since rebooting disrupts everything for a while. Conversely sometimes rebooting actually is the best means to this end because it's the approach that will result in the shortest disruption. For one example, if your system is swapping itself to death due to temporary excessive memory usage you could wait it out (or play the slow game of 'hunt the memory hog when the system mostly isn't responding') or you could reboot. It's highly likely that rebooting will get your machine back into service the fastest, sometimes by hours.
There are many factors that play into your answer in any particular situation, things like how long a particular approach will take to restore the system to service, how much more disruptive it will be than the current or likely future situation, when good and bad times are for disruptions, and whether there are additional issues like gathering information for further troubleshooting. There is no single universal right (or mostly right) answer. Like much system administration, it's situational.
(In fact sometimes rebooting servers randomly is the right approach. But that's not a common environment, or at least not what I think of as a common environment.)
PS: In the spirit of honesty I must admit that this entry was sparked by my feelings about some reddit reactions to a recent entry. Probably I should have heeded the classic xkcd lesson.
Sidebar: rebooting versus going to single user mode
As a side note, to say that rebooting a server is terrible and you should avoid it by bringing the server into single user mode and then returning it to multi-user mode is to miss the forest for the trees. Going to single user mode is almost always just as disruptive as rebooting a server since you terminate all user processes, bring down all services, stop network routing, and so on.
It's also probably significantly more reliable to reboot a server instead of bringing it to single user mode and then back to multiuser mode. The code paths for bringing services up in a just-booted environment are tested all the time, while the code paths for bringing services up (and down) in a multiuser to single user to back to multiuser environment are tested very, very rarely. Are you absolutely confident that everything cleaned up after itself and fully reset all state when going into single-user mode? I'm not.
(If you're confident I certainly hope that you've tested this extensively and carefully for your particular environment. I certainly don't think that your test results can be generalized.)
2014-03-14
Logins and related things really do change, and for good reasons
Every so often it's popular to say that you will never, ever change a (Unix) login, an assigned email address, or whatever. No direct renamings, no new account to replace the old account, no nothing. Generally this attitude comes with a certain mixture of 'you should have got it right the first time' and 'if your login is less than ideal it doesn't really matter'.
This is wrong (and arrogantly blind). People periodically have excellent, compelling reasons to change their login et al and you are eventually going to have to change them one way or another. If you aggressively stick to your 'no changes' view, it's quite possible that very bad things will happen; one of the least bad ones is that important people will quietly leave your organization.
Let us take the most straightforward and obvious example. Suppose a married woman has a login, and of course when it was created it followed your common pattern of having her last name in it. Oh, and when she married she took on her husband's last name, because this is still common. Then one day she gets divorced and of course changes her last name back to her own. This is an excellent, compelling reason to rename her account, or rather two reasons at once. First, this woman is going to be want to be called (in logins, email addresses, etc) by what is now her actual name. Second, she is quite possibly not going to want to be reminded of her ex-marriage and ex-husband every time she logs in, gets email, has to send email, and so on.
If you tell this woman 'sorry, we're still not renaming your login, that's our policy', what you are doing is giving her a great big middle finger. If she has actual power in your organization, your policy is probably not going to last long and you will have created a bunch of bad blood. If she does not, any number of things may happen, such as her quietly resigning so she can go somewhere where she is not frequently reminded of her ex-marriage and how insensitive your organization is.
This is far from the only case where there are excellent reasons to change a login. It's simply an obvious one with a not uncommon situation where hopefully everyone can see the real injury done by not changing the login. People really do have really good reasons to change their login, reasons that they could not possibly have predicted in advance and so avoided. They are not being irrational or picky or any number of other things. And sooner or later you will wind up changing someone's login.
(In a relatively small environment it's possible for this to only happen very infrequently and you might actually never have it happen while you work for a particular place. In a large environment it probably happens relatively frequently.)
The corollary of this is that as much as possible you should design your systems so that they at least accommodate login changes from the start. Don't assume that logins, email addresses, names, and so on are unchanging. If you need an unchanging primary identifier for people, make it a meaningless one (a GUID or a random number is good).
2014-03-09
Why we don't change Unix login names for people
Every so often as system administrators we are a bit lazy. Or perhaps you could say that we are a bit sane. One of those cases here is that we do not, ever, change people's Unix login names. If you really want or need a change in login name, what we tell you to do is request a new account with the right login name, then transfer all your files to it and tell us to delete the old login.
(Users can set up their own email redirection from the old login to the new one, assuming they want to.)
In theory changing a Unix login name is easy; all you need to do is
edit /etc/passwd to change it (both in the login name and in the home
directory), then rename the home directory itself. Except we should
probably change the login name in secondary groups in /etc/group. But
we're not done, because users have a second home directory on our web
server; we need to change that.
Unfortunately we've only started. Right now we have six separate machines that run Samba, all with separate Samba password files. I'm not exactly sure how you rename a Samba login but we'd have to do it on all of those machines. We also have at least a dozen machines where users might have crontab files (but probably don't). If you rename a login you need to rename the crontab file (as far as I know) so we'd have to check all of them and fix anything we found. The login being renamed might also have a user managed webservers that uses a URL under the user's web pages; that would need to get renamed.
This is quite a list and I'm not even sure that I've thought of all of the places where the user's login name might be hiding in our environment (and yes, I'm ignoring at jobs for the moment). In theory we could try to do all of this and make sure not miss a single thing. In practice it is much easier and much more reliable to get people to use our well-honed and frequently used procedures for creating and deleting accounts.
(We make accounts all the time and delete them periodically. We might 'rename' a login once a year.)
Can things still fall through the cracks, especially if the person getting the new login name doesn't notice? Certainly. But one subtle advantage here is that we aren't promising more than we can really deliver. If we promised to rename an account you might reasonably expect that all of this additional state would get transferred. Since we're merely making a new account it's clear (at least in theory) that additional state is something you have to worry about.
PS: A pragmatic side advantage of this approach is that we don't push back against who people want login name changes in the way we might if doing a login rename was a lot of manual work on our part. There actually used to be a policy that we just didn't do login renames short of acts of very high powers; that went away when we decided to do them the easy way. Nowadays it is more 'you want to change your login? well, sure, you'll be doing most of the work' (although we don't say this on in our support documentation).