Wandering Thoughts archives

2008-02-26

Something that I do not understand

Here is something that I do not understand: why is it always that the oldest and most clapped-out set of displays that you have is always to be found in the machine room, barely working at the best of time?

Yes, I can see the surface logic; machine room displays are probably your least used displays and they spend a great deal of time doing nothing. But every so often you have a crisis and then you really need them to work, because in a crisis the last thing you have time and temper for is trying to get the flaky display working; you need things that work great all the time, because you are already under enough pressure from the crisis without adding more flaky bits of hardware to deal with.

(While I am not suggesting that your machine room displays should be high end huge LCD panels, I would also like to point out the benefits of clear readability in times of stress. Do you really want to be squinting at tiny type on an equally tiny display during a crisis?)

In passing, I note that the really dangerous displays are ones that usually work but flake out every so often, because they tempt you to not get rid of them (especially if they keep working when you test them and only die when you don't have the time to look into it). Do yourself a favour and exile them from your machine room.

MachineRoomMonitors written at 23:22:56; Add Comment

2008-02-25

The best way to shroud IP addresses

I wrote before about the limitations of hashes as privacy protectors, which leaves us with a not entirely theoretical question: suppose that you have to create a traffic accounting system, but you don't want to keep too much logs. What's the best way to protect user privacy in this situation?

Our answer is to drop the first octet of the IP address. This gives much better privacy protection than dropping the last octet or even the last two octets, because it cuts against how IP address space is organized.

(One way to put this is that it often hides much more information to drop the significant digit than the least significant digit.)

This may not meet everyone's needs, but our particular circumstance is that we don't really care about traffic volume ourselves; we just need to be able to identify the users involved when the campus NOC contacts us with reports of high bandwidth usage. The NOC reports have the full off-campus IP addresses involved, so we can look for high traffic volume to the obscured version of the IP address.

(Disclaimer: my co-workers came up with this idea, not me. I'm just sharing it because I think it's a useful approach to the problem.)

You do lose information with this, but then you lose information with any scheme that preserves user privacy; we could not do things like conclusively identify users that had visited an infected website (or who had zombied machines that were communicating with some master point). If we needed to do that sort of thing on a one-shot basis we would probably set up additional monitoring; if we needed to do it on an ongoing basis we would have to rethink the system.

(I suspect we'd wind up with hashed IP addresses.)

ShroudingIPs written at 23:09:41; Add Comment

2008-02-21

Wireless, machine rooms, and the Asus Eee PC

I want to note for the record that the combination of an Asus EEE PC and a wireless network in your machine room is a great idea. For best results, make sure that your wireless network does not depend on much of your overall infrastructure.

(Yes, we actually put a wireless access point in our machine room. Since the core wireless gateway is also in the machine room, this is far less crazy than it might seem.)

Also, the Eee is smaller than my current lab notebook, which is 24.1 x 15.2 cm (aka 9" x 6"; Canada may be metric, but we're not that metric), and only a bit thicker. The Eee is significantly heavier, though, since we have not yet gotten computers down to the density of paper.

Sure, technically you don't need an Eee to make this a good idea; any wireless laptop will do, since you can work from anywhere in the machine room and wander around as needed. I think that the Eee turns it into a great idea because the Eee is so small that you can park it pretty much anywhere. Big laptops can get precarious in the confined space of a machine room (and at least in our machine room, any flat surface tends to get colonized fairly rapidly).

WirelessMachineRoom written at 23:32:03; Add Comment

2008-02-20

How our automounter replacement works

As I've mentioned in passing, at one point we got so irritated with peculiar automounter issues that we wrote our own replacement that does what we want. It turned out to be fairly simple, because what we want is just to keep each system's mounted NFS filesystems in sync with a master list.

Since this is basically diff, you could clearly do this in a shell script. I opted to write the core of the system in Python so that it would be easier to handle an existing mount changing as well as mounts appearing and disappearing.

(Also, at a certain point you are actually writing a program instead of just a shell script, and the Bourne shell is not a great programming language. Your successors will thank you for writing in something clearer.)

The core Python program reads the master list and mount -t nfs's output and spits out shell commands to reconcile the system's state. For new mounts it mkdir's the mount point if necessary and does the NFS mount; for now-dead mounts it does a umount -f and then rmdirs the mount point if it is one of our special hierarchies. If an NFS mount changes parameters, we unmount the filesystem and then mount it again; while we might be able to do some changes with mount -o remount,..., there are some that we can't do that way (for example, changing the source of a mount point).

The core program is wrapped in a shell script that finds the right configuration file and feeds the program's output to a shell, shows it to the sysadmins, or both, depending on options. The shell script is then run from cron every ten minutes or so, and it gets run during boot to get all the NFS mounts set up.

(The shell script also checks local and global flag files that tell it to do nothing, so that we can manually manipulate NFS mounts without having our actions undone in the next ten minutes. You want this feature, trust me.)

Error handling is simple. If an umount or a mount fails, there's no need to do anything special; the next time the script runs, it will notice that things are not in sync and try to fix them.

AutomounterReplacement written at 01:00:57; Add Comment

2008-02-15

A weird routing mystery

Once upon a time, we had a machine that wound up with a default route that pointed straight to the local network, basically what you'd get if you did route add default dev eth0.

(Disclaimer: I have no idea if your system would actually accept that route command or if it would demand a gateway.)

That this worked to some degree is not too surprising in retrospect; there actually is a straightforward meaning to this, namely to arp for all destinations on the local Ethernet, and that's what our machine did. The weirdness comes in what happened next: the machine could still ping another system that was on a completely separate subnet on the same physical network.

On the one hand this makes perfect sense: the machine was arp'ing for the other system's Ethernet address and then just sending the packets to it. On the other hand, this makes no sense: how was the other system replying to those ping packets? The other system was not on the first subnet and had no route to it, so in theory it should have dropped its ping replies as unroutable; instead it just shoveled them back on to the Ethernet.

(My best guess has to do with the first machine being present in the other system's arp cache, but I'm not completely convinced.)

WeirdRoutingMystery written at 23:33:49; Add Comment

2008-02-05

Prewiring experimental racks

I think that you should prewire experimental racks for power. By this I don't mean just making sure that they have some PDUs in them; I mean running actual power cords up the sides, with unconnected ends tied off every so often ready to be plugged into machines. If you turn out to need more power cords for a machine, don't try to run more power cables along the side unless they too are going to be permanent; just dangle them down the back, where they are easy to remove later.

The goal is to keep things neat, instead of the situation you get from pulling, repulling, pulling out, and so on a snarl of power cords as you move test machines in and out of the rack. Unused but organized power cables may not be ideal, but they are a lot better than the real alternative.

The power cords should have enough slack to reach to the other side of their rack slot, since you never know which side of a machine you're going to need to plug them in to. If you have multiple PDUs in the rack, label the cords with what PDU they're plugged in to; if you have smart PDUs that can power cycle outlets, label both ends with what outlet number the cord is in (or at least should be in).

You may want to do this with network cabling too, depending on the cabling density. This is one case where a smart switch at the top of the rack may be a lot less messy than the alternatives, and you're unlikely to need lots of bandwidth out of your experimental rack unless something very peculiar is going on.

(This thought was prompted by some machine shuffling I did in our experimental rack, which has something of a power cord snarl problem.)

PrewiringTestRacks written at 23:32:21; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.