2007-03-28
A surprise to remember about starting modern machines
In the very old days you connected to Unix systems through serial
terminals, which only had their getty processes started once
init had finished processing /etc/rc.
In the old days you connected to Unix systems through rlogin(d),
which was started through inetd, which was still started pretty
much at the end of system startup.
These days you connect to Unix systems through sshd, which is often
started relatively early in the system boot sequence. This means that
you can easily wind up logging into a machine that hasn't finished
booting, and conversely that just because you can ssh into a machine
doesn't mean that it's finished booting.
This mistake was at the root of my debugging adventure today. We're
switching to a new system of managing NFS mounts on our Ubuntu machines,
and I was seeing a mysterious problem where the test machine would boot
up with its NFS mounts partially or almost completely missing. Due to
local needs we start sshd before doing our NFS mounts, which we have
a lot of, so what was really going on was that I was logging in to the
machine while it was grinding through the NFS mounts. Once I realized
what was actually going on it was a definite forehead-slapping moment
(although a reassuring one, apart from the wasted time, since nothing
was actually wrong).
You can get into really weird states because of this. In the past I've
managed to have init.d scripts hang trying to start something; if
they run after sshd starts you could still log in to the system, poke
around, and have everything look pretty normal (depending on what was
left in the boot sequence). Except that things like reboot wouldn't
do anything, because as far as init is concerned it was only part way
through transitioning into a runlevel and it wasn't about to let you
change to another one just yet. The whole experience can make you think
that the machine is badly broken, because reboot doesn't complain and
a machine that doesn't reboot on command is usually in serious trouble
(often with things like kernel panics, unkillable stuck processes, and
so on).
(I think what tipped me off back then was the same thing as this time around; I got a process tree dump and saw the startup script still running.)
2007-03-23
Counterintuitive RAID read performance
While doing performance tests on an iSCSI RAID controller, we recently turned up some unexpected results: the controller could write significantly faster than it could read, to both RAID 5 and RAID 0 targets. In one case, a six-disk RAID 0 target could do streaming writes 20 megabytes/second faster than it could do streaming reads (75 MB/s write versus 55 MB/s read). This surprised me a lot, because I usually expect reads to run faster than writes. (It's certainly the case on single SATA disks, and this was a SATA-based iSCSI controller.)
(We saw this behavior with both Solaris 10 and Linux, just to rule out one variable.)
Someone I talked with online suggested that what's happening is that writes are being implicitly parallelized across the disks by the writeback caches on the controller and the disks (and the operating system delaying write-out), whereas the reads aren't. It's easy to see how the writes can be parallelized and done in bulk this way, but why aren't reads also being parallelized?
There's two places the whole system can parallelize reads that I can see:
- if the operating system issues large read requests to the array,
the array could immediately issue requests to multiple disks.
(The operating system can also break the single large read up into multiple SCSI commands and use CTQ to issue several of them at once to the array, which can then distribute them around the disks involved.)
- if the operating system does aggressive enough readahead we'd get at least two simultaneously active requests, which would hopefully hit at least two different disks.
We want the OS to do large readaheads and issue single IO requests that are several times the stripe size of the target (ideally the stripe size times the number of disks, since that means one request can busy all of the disks). However, many operating systems have relatively low limits on these, and for iSCSI you have to get the RAID controller at the other end to agree on the big numbers too.
I suppose this is why many vendors ship things with small default stripe sizes; it maximizes the chance that streaming IO from even modestly configured systems (or just programs, for local RAID devices) will span multiple drives. And streaming IO performance is something that people can easily measure, whereas the effects of small stripe size on random IO are less obvious.
(iSCSI performance tuning seems to be one of those somewhat underdocumented areas, which is a bit surprising for something with as many knobs and options as iSCSI seems to have. Tuning up the 'maximum burst size' on the iSCSI controller and the Solaris 10 machine got me up to 60 MBytes/sec on streaming bulk reads, but this is still not very impressive, and it may have made writes worse.)
2007-03-20
On educating users
In a context that's not important, someone on our local sysadmin mailing list recently wrote:
I think the bottom line is end-user education.
I disagree. It is my opinion that any time end-user education appears to be the answer, we have already lost. People do not change their behavior just because we want them to, and they rarely really change their behavior because we threaten them. (Although they are very good at faking it until we are just enough out of sight.)
The only time people really change their behavior is when the new behavior is less work than the old behavior. The only time they like changing their behavior is when you show them a better and easier way to do things; when you make their life better. This is the only time 'user education' really works.
(While you can get people to change by cranking up the pain level on the old way instead of cranking down the pain level on the new way, they are not going to like you.)
One corollary of this is that if you absolutely have to get people to change their behavior, you have to give them no choice. I don't mean 'no choice' as in ordering them to do it on pain of being fired; I mean making it impossible for them to do things any other way than your way. (And then you should be prepared for people's ingenuity.)
Also, it is not enough for the new way to be just as easy as the old way and also superior in some indirect way (such as being less work for other people). To truly get adopted, it must be directly easier and better for the people who have to change; otherwise, inertia will keep a lot of people using the old way.
2007-03-14
The problem of machine startup order dependencies
One of the tricky bits of organizing a sufficiently large group of machines is avoiding circular dependencies in the machine startup order, so that you can actually bring your systems up after things like a complete machine room power outage.
(In our case it was planned; the electricians wanted the master breakers off before they played around in our breaker panel to give us more usable circuits.)
Startup order dependencies come in a variety of flavours. The simple one is a startup script that depends on another machine being up, for example trying to NFS mount filesystems; more advanced, more dangerous, and fortunately much rarer is the sort where a machine will start but malfunction (for example, bounce all email) unless another machine is already up. Things like NFS mounts are easy to see, but sometimes the dependency is more indirect and much less obvious.
Part of the problem is that it's easy for this sort of dependency to creep in unnoticed. Not only is a complete ground-up restart of all of your machines hopefully a rare event but testing for this sort of thing is difficult to do, especially for machines in the middle of the startup order (where they depend on some other machines but not everything).
(You can always do a testing ground-up restart of everything, but this is sufficiently disruptive that you're probably not going get to do it very often.)
The interesting case that we found recently was machines that try to
set their time on startup with ntpdate, especially our console server
(which is the first machine we start). In the early boot order, none of
our time server machines are alive to respond to ntpdate; fortunately
it has a timeout. But up until that point I hadn't thought of NTP as a
vital core service.
(For bonus fun, what actually timed out on the console server was
ntpdate's DNS lookups, because all of the time servers to synchronize
with had been specified as hostnames instead of IP addresses. Since
the machine had three time servers and two DNS servers listed in
/etc/resolv.conf, this actually took significantly longer than
ntpdate's actual query timeout.)
2007-03-13
Machine room archaeology
The Computer Science department has been using its primary machine room for at least 25 years, and it's a proper machine room, complete with a raised floor. (People with the right sort of experience are now wincing.)
The problem is best illustrated by an anecdote: we just recently took out some hybrid 208V plus 120V power circuits that we believe were put in to power old Vaxes (which would make them obsolete for going on 20 years). We didn't have them removed out of any sense of neatness; we pulled them only because we needed the breaker panel space they were taking up for more plain 120V circuits.
(They were taking up three breaker spaces each, because apparently each hot wire uses up a space; 208V uses two hot wires, and the hot wire for the plain 120V was the third. My new job is an education in many things, electrical power issues included.)
The problem with raised floors is that over time all sorts of things accumulate down there below the floor tiles, because they aren't in people's way to be tripped over and thus yanked out. In fact, once you pass a critical snarl point pulling things out takes more work than leaving them there and just running new wiring over top.
By now, lifting up our floor tiles is an archaeological expedition into the dusty depths of our machine room's past. The CAT-5 tangles are the most recent stratum, then comes the carefully tied down runs of now obsolete serial cable that once connected to various consoles (we think), and down at the bottom you can still see the faded orange of a loop of thick Ethernet, complete with vampire taps and thicknet cables stretching off to somewhere. We're not sure what stratum the occasional dusty power cable belongs to, or whether they're still connected to anything.
(We are still better off than my old job, which once managed to accrete a slowly growing puddle of water until it seeped through a cinderblock wall and started soaking the carpet in my cubicle. Although to be fair, this can happen to anyone at more or less any time. That's the other problem with raised floors: you can't see what's going on underneath them, you just have to trust that nothing interesting is.)
2007-03-12
New warning messages might as well be fatal errors
There's a widespread view that a good way of deprecating something is to have the new version of your project emit warning messages when it runs across code using whatever you're getting rid of. The theory goes that this is pretty harmless; the people using the to be removed feature will get notified about the situation but can fix their code at relative leisure, because things still work.
As a working Unix system administrator, allow me to disagree violently. To sysadmins, a program's output or its lack thereof is a vital part of its behavior; adding things to the output is anything but harmless, and often forces us to fix the program right away. Many times the program is effectively no longer working and you might as well have made the warnings be fatal errors instead.
In real life, a program's output is part of its interface, and thus adding 'harmless' deprecation warnings is changing the program's interface. Sometimes you will get lucky and no one will have been depending on that particular bit of the interface, but this is rare if your library, language, or whatever is at all popular, and it is generally out of your control (because it depends on what people have built with your project).
If you want to print deprecation warnings for something, the system administrators of the world will thank you if you make it not the default behavior. That way people who actually care can check when they want to (or run in checking mode all the time), but you won't cause pain for others.
(This applies several times over if you print the warning more than once. Especially if you do not rate limit the warnings.)
If you want to print deprecation warnings anyways you should treat them as the last step before removing the feature entirely, not the first step. In fact, you should treat the addition of the warning as the removal of the feature, and tell people to consider it an incidental bonus that some programs keep on working anyways. This should hopefully create the right mindset in all parties.
(If it forces you to slow down the feature removal, tough. That's what the schedule should have been to start with.)
2007-03-10
What a sysadmin's machine should be able to do
What a sysadmin's machine should be able to do has been on my mind recently, because my group is working on getting everyone upgraded from random creaky old hardware to modern PCs. (It's a somewhat contentious issue in many quarters, because it's not obvious to a lot of people that sysadmins need much more than a (graphical) terminal.)
At a high level, I feel that a sysadmin's machine should be able to burn DVDs, drive dual displays, and run fully virtualized operating systems at a decent speed. (Well, my actual wording to people here was 'run VMWare', but it's not the only choice for the job.)
(I'm assuming that everything has a basic level of capabilities, like USB and audio and gigabit Ethernet, so I'm only looking at uncommon things.)
At a slightly lower level, you want a 64-bit CPU so that you can run virtualized 64-bit OSes, because if your servers aren't already running 64-bit OSes now, they will be soon. In my opinion, it should be a dual-core CPU, so you can test genuine SMP issues, and it should have hardware virtualization, because that expands your options for what virtualization software you can use.
(Hopefully DVD burning is a no-brainer. Theoretically you can still save a few dollars per machine by buying CD-only hardware, but so many things don't fit on a single CD any more that it is more than worth the slight extra cost to avoid CD swapping hell. Especially on servers, where each swap may involve a trip to a machine room.)
As a recent convert to the church of the dual display, I would like to say that actually having dual displays should be mandatory. But I think that dual displays are still too often perceived as a luxury and the arguments for them aren't yet solid enough to overcome this, especially for sysadmins.
(I find the whole issue to be amusing from a suitable distance; I can remember back to the era when people paid quite large sums of money for decent 17" CRT monitors for sysadmins without blinking. Even ignoring inflation, two 19" LCD panels now cost less than one of those CRTs did and overall sysadmin workstations have become dirt cheap, so in a sense people are spending an inordinate amount of effort arguing about spare change. (The counter argument is that people spent so much money back then only because they didn't have a choice, and now they do.))
2007-03-02
A story of network weirdness
We have a number of internal networks here. One of them is a port-isolated subnet for general user machines (such as Windows laptops), where the port isolation makes sure that user machines can't talk to each other and thus can't infect each other. One day, an alert user on the port isolated network reported to us that his machine was seeing packets from the outside world destined for a completely different machine.
(One of the cool things about working in a Computer Science department is that we have users that will actually notice and report this sort of thing.)
It turned out that the cause of this failure in port isolation was asymmetrical routing. The target machine had a second interface on another internal subnet, and what happened was:
- the target machine brought up its interface on the port isolated subnet and made an active, long-lived TCP connection. This made its port-isolated IP address the origin of the connection.
- it brought up its interface on the other subnet, and somehow
made the gateway for the other subnet its default route.
Since changing routes doesn't change the origin IP address on established connections, this created an asymmetrical route: outgoing packets for the long-lived TCP connection went out the other subnet, but incoming packets were (properly) routed back through the port isolated subnet.
- since the switches in the port isolated subnet weren't seeing
any outgoing traffic from the target machine, they started
forgetting its Ethernet address to port association.
(But because it was an active connection, the IP address to Ethernet address mapping stayed in the router's ARP cache.)
When a switch doesn't know what port is associated with the destination Ethernet address of an incoming packet, it broadcasts the packet to all ports. In short order, packets for the target machine were being flooded to every port in our entire port isolated subnet, where one alert user noticed the strange traffic.
This wouldn't have happened with a less active connection, because the router's ARP cache would have timed out, forcing an ARP broadcast, causing the target machine to reply over its interface on the port isolated subnet, causing the switches to (re)learn the necessary Ethernet address to port associations.