Wandering Thoughts archives

2011-12-26

Labs versus offices for sysadmins (or at least us)

On the one hand, lab areas are great because they mean that noisy machines aren't in your office. On the other hand, lab areas are bad because they aren't your office, including that they're noisy, often uncomfortable, don't have your system setup, your phone, and so on. Really what you want is machines in the lab that you can fully control from your office with at most occasional in-person visits; sadly we rarely get that.

This means that there's a constant tension between putting test machines in the lab area and putting them in your office. At least around here, what tends to happen is that relatively quiet hardware winds up in people's offices for testing rather than being dropped in one of our lab areas; the annoyance of having the hardware in your office is less than the annoyance of having them not in your office. In turn, this drives our desire for lots of drops (per earlier entries), and when we don't have lots of drops people wind up running network cables between offices because it's still more convenient than trying to rig something up in a lab area.

(Individual tolerances for noise vary; my co-workers are far more tolerant than I am and so they have a lot more stuff in their offices than I do in mine. Also possibly this means our lab areas aren't set up well, which is possible; they have no more network drops than our offices do, since the entire building was wired up a long time ago.)

Now, I've kind of given an incomplete view of what we do with hardware. We don't really have a good lab area that's isolated enough for actively loud hardware, like your garden variety really noisy 1U servers, so they wind up getting shoved into a machine room if they're going to hang around for long. What we mostly wind up using either in offices or in the lab area is things like switches or various desktop machines we use to build test servers and test networks; for example, if we want to build a duplicate of our Samba and CUPS environment we don't do it on actually 1U servers, we just grab a couple of spare desktops and start installing. They're not as powerful as the real thing and they're not quite identical to it, but we can put the same software on them and they're a lot more convenient (quieter, less demanding of power, easier to find space for, etc).

(Some people use virtualization for this. Locally, I'm the only really active user of this approach; my co-sysadmins mostly prefer using real hardware.)

LabsVsOffices written at 00:54:48; Add Comment

More wiring for sysadmins: sysadmins and gigabit networking

In reaction to comments on his entry The Other Way, Matt Palmer wrote in part about my concerns about office switch uplink bandwidth for sysadmin drops (in WiringForSysadminsII):

[...] What I question is the need for constant, sustained gigabit over an extended period to another isolated machine such that you need a dedicated link to them.

I sort of half-agree that sysadmin machines and drops don't need constant, sustained gigabit bandwidth (although I'm not entirely sure about that). But what they do need is occasional periods of real gigabit bandwidth, and real gigabit bandwidth when you can be absolutely sure that the underlying link will deliver gigabit data rates and the only performance limits are those created by the machines, switches, and so on at either end.

If I'm testing how fast hardware and software can go or if I'm trying to investigate network performance problems that have been reported to us, I need an environment that is not artificially contaminated by other networking traffic coming in to my heavily VLAN'd office switch. I know that some amount of contamination is there (I have tcpdumps of our internal networks and some of them are remarkably noisy); it may be enough to be significant, or it may not be. I don't want to have to guess about it and make assumptions. I want a clean gigabit, one that's as close as possible to what machines and users would see in the real environment in our machine rooms or in user offices.

WiringForSysadminsIII written at 00:25:47; Add Comment

2011-12-25

Why office switches plus VLANs aren't the answer for sysadmins

In more or less reply to my last entry on wiring offices for sysadmins, I got a tweet:

@thatcks 2 drops is enough. Desk switch per sysadmin + vlans and bob's your uncle

(I can't find the tweet in the person's public stream right now so I'm not going to attribute it. Maybe I should paraphrase it instead of quoting directly, but any paraphrase would probably be longer.)

I thought of this, but there are three reasons why this doesn't work: uplink bandwidth, maintenance overhead, and a network design that doesn't have everything as part of the same VLAN fabric.

The maintenance overhead is pretty straightforward. Any time you want to set up a new network or tear it down, you need to modify a bunch of switches to add or delete VLANs; each sysadmin's switch and then your master uplink switch for all of the sysadmin offices. Even if sysadmin office uplink is the only thing this switch does, it is a 'production' switch in that other sysadmins are not going to be happy if something goes wrong on it and things suddenly drop off the network. This is a pain at the best of times and can rapidly reach the point where running cables around the office is easier.

The uplink bandwidth issue is twofold. First, your total bandwidth across all VLANs is limited by the switch uplink bandwidth. In all realistic configurations this means you have a 1 GB total limit, and yes this can easy get in the way of certain sorts of testing. Second, the more VLANs you push over a single uplink, the more bandwidth you lose to background chatter on those VLANs and any ordinary traffic your regular machines may be doing (on regular production VLANs). Among other things, this complicates efforts to measure the true network performance of machines; are you getting less than a gigabit because they just can't deal with a gigabit, or is it because of other traffic on your office switch?

You can deal with some of the uplink issue by not propagating currently unnecessary VLANs to your office switch, but then you wind up needing to reconfigure at least your master switch every time your VLAN needs change. We are currently in this situation and take it from me, it is a pain in the rear that discourages testing.

The network design issue is that some of your networks may not be designed to run over your normal VLAN fabric and switches. There are three examples of this here (for background, details of our network setup):

  • we have several port isolated internal networks that are kept out of our regular VLAN fabric except at touchdown points for internal firewalls.
  • we have a completely isolated management network that runs over its own dedicated switch fabric. Bringing it into contact with our regular VLAN fabric in order to get it to our office is at least a violation of its design and has potentially bad effects if, for example, some of the switches involved also have management ports that are directly connected to the management network.
  • we have deliberately isolated, non-VLAN'd, non-connected iSCSI networks. We may at some point want a port on an iSCSI network in our office, but we definitely do not want to pull the iSCSI networks into our VLAN fabric; we want a direct port to the iSCSI switch.

Trying to merge together otherwise isolated networks and VLANs on a subset of switches makes me nervous. There is a real possibility for accidental leaks and contamination (as well as weird side effects), and it's especially acute when you're reconfiguring the master office switch often. Of course this also holds for putting new VLANs for test networks on the master office switch, since these new VLANs are not part of your regular VLAN fabric and should never be propagated to it.

Sidebar: what I want in an office setup

The short form version is what I want is one port to carry the primary networks for my main machine, one port for a switch with all of our regular VLANs on it so I can connect to them for testing, one port with our port isolated network for user machines, one port for our isolated management network, and several other ports that I can connect to anything as the need arises. This is at least five ports.

(Thus I think the basic need is two ports for static stuff (one for your primary machine, one for the 'has everything' VLAN switch), plus some number of ports for floating things.)

Right now I have, effectively, three ports: one port with a selection of our regular VLANs that feeds through to my main machine and one port with an Ethernet splitter that gives me our port isolated network for internal machines plus our management network. The latter two networks only run at 100 Mbits each but that is not currently a problem. This is not enough ports, and I don't do as much networking work as my co-workers.

WiringForSysadminsII written at 02:18:00; Add Comment

2011-12-24

Wiring offices for sysadmins

Our office full of sysadmins has a network wiring problem: we don't have enough. Watching how we've dealt with this problem has given me some opinions on how you should wire an office area for sysadmins in specific, as opposed to just general usage.

In a conventionally wired office area, all of the drops (network ports) run back to a big wiring closet (generally one to a floor or so) or even all the way back to your machine room. In the wiring closet or machine room, the drops go to a patch panel and are then connected to appropriate switches (and reconnected, as your networking needs change). This is a perfectly sensible arrangement and has the great advantage that you don't need to go into office spaces in order to shuffle what network a port is connected to; assuming that you have an accurate port number you can just go to the wiring area, switch the cabling, and you're done.

However, this is not the right setup for a sysadmin office area. In a sysadmin office area all of the drops should go to a wiring closet area in the office itself, which is also where all of the connections from the main wiring closet or machine room should go. Why?

A sysadmin office area has the unusual requirement that we periodically want to set up new private networks, ones that are mostly or completely disconnected from our regular networks. Going off to the machine room or the floor's wiring closet every time you want to do this is a time-consuming pain; since sysadmins are either lazy or very good at working efficiently (depending on your perspective), the end result is that most of the ad-hoc testing networks will actually be implemented by just running wires around the office. The end result of this is wires strung all over the place.

(The exception is any test network that needs to touch servers in the machine room.)

Running sysadmin drops back to something in the office makes it easy to set up these ad-hoc testing networks, in fact easier than grabbing some cabling. This is what you want to keep the office in some sort of order.

There are various downsides to this two-stage wiring, with different ones depending on how you've set things up. The top level summary is that, well, you've added another wiring closet and thus another level of indirection in your network. My personal opinion is that it's worth it. If you want to reduce the problems, you could wire the normal office drops straight through to the normal wiring point and then add extra drops (clearly marked) that go only to the in-office wiring area. The drawback of this is that you have to decide how many of each sort of drops each spot will need instead of being able to adjust the purpose of drops on the fly.

WiringForSysadmins written at 01:07:21; Add Comment

2011-12-23

Disk space in the modern world

A while back I wrote about someone looking for long-term archives of 10 to 20 Tbytes of data with the conclusion being that you shouldn't try to build archives, just a live fileserver. As it happens, I have a theory about why the question was asked in the first place; I think that many of our sysadmin instincts about disk space are miscalibrated for the modern world.

To put it simply, I suspect that a lot of sysadmins come from a time when 10 or 20 terabytes was a heart stopping big amount of disk space. If you needed multiple terabytes of space, you needed a big solution, something that would take up a bunch of space, cost a lot of money, and called for a bunch of careful planning to design and spec out. 20 Tb of disk space wasn't something you could put together casually; it was big iron, at least by the standards of non-enterprise setups. In short, 10 or 20 terabytes were a big deal.

That's no longer the case. In the modern world, 20 terabytes is no longer big iron (although it's still not trivial). It's perfectly sensible and only a little bit expensive to put together an environment with that much disk space, and it no longer needs extensive planning. Our instincts will adjust to this new reality in time, but in the mean time I sometimes have to remind myself that terabytes of disk space aren't a big deal any more.

(This is only a pretty recent development, one created by affordable terabyte-plus hard drives and the general adequacy of SATA drives. Of course, hard drive space isn't the only thing that this is happening to; we've been having a similar effect with RAM for while.)

ModernDiskSpace written at 00:52:47; Add Comment

2011-12-20

A little script: nssh

(Once again it's been a while since the last little script.)

One of the things that we do reasonably often around here is install and reinstall servers. When we do this, the server's SSH host key changes (either permanently or temporarily until we can restore its canonical key), and of course then ssh'ing in to the newly reinstalled server complains about host key mismatches.

A while back I got tired of having to deal with this by hand, so I decided to automate it. Enter a script that I call nssh:

#!/bin/sh
# ssh with no host keys
exec ssh -o 'UserKnownHosts File /dev/null' \
         -o 'PubkeyAuthentication no' \
         -o 'StrictHostKeyChecking no' "$@"

(Okay, my script actually doesn't explicitly set StrictHostKeyChecking because I long ago made it a default in my .ssh/config, on the grounds that this was what I was doing by hand anyways; I always just said 'yes' when ssh prompted me. I have a number of odd behaviors with ssh.)

This is a trivial little script but it's turned out to be very handy, like others before it. Tiny or not, it eliminates an irritating bit of make-work and that makes me happy.

(The need for this script while dealing with machines being reinstalled is an artifact of how our install system works. A more sophisticated install system could arrange for the correct canonical host keys to be installed before you needed to ssh to the new machine.)

LittleScriptsVIII written at 00:39:28; Add Comment

2011-12-14

What makes backups real

DEVOPS_BORAT tweeted today:

Is not about backup, is about restore.

YES. Many, many times yes.

DEVOPS_BORAT is undeniably funny, but sometimes those funny things are also pithily saying something very important where you shouldn't just laugh and move on. This is one of those times.

Until you have tested restores, you do not have backups; you have a superstitious ritual that may or may not write some useful bits to some place. What is important is not making those bits; what is important is getting things back. If you are not testing restores, you are just going through the motions of backups without knowing if they actually work. Restores are what makes backups real instead of cargo cult rituals.

Make your backups real today, before you find out the hard way that you've just been performing a superstitious ritual.

(The ideal test is an end-to-end restoration where you don't just test that you can, say, restore a database's files from backups; you also test that your database software is happy with the files and that all of the information is there.)

If you want hair-raising things, I've written about all of the things that can go wrong with backups before.

WhatMakesBackupsReal written at 22:15:45; Add Comment

2011-12-12

What debugging info I want from things like SMF and systemd

I've recently been tooling around in both SMF and systemd, so I've developed some strong opinions on what startup systems like this need to provide to help working sysadmins. It will probably not surprise you to know that neither SMF nor systemd deliver entirely useful information today.

There are two sorts of service startup problems, which I'll call the simple and the complex. The simple question is 'why did this service fail to start'. To answer this sysadmins need to know exactly what was run and what happened to it, ideally including its output and log messages; if the service wasn't started because of missing dependencies, we need to know what they were. SMF and systemd both half-heartedly deliver this today.

(Neither has output that is optimized for making it clear if a service is not enabled at all, is enabled but had missing dependencies, or if it was enabled and failed to start or died later. In fact distinguishing between 'failed to start' and 'seemed to start fine, died later' is actually fairly important but doesn't tend to be reported well.)

The complex startup problems are ordering problems, such as our recent issues with ZFS pool activation and iSCSI disk discovery. To deal with these issues, you need two things: you need to know the actual order that service startup both started and finished in (you need both because in a modern system several services may be starting at once), and you need to know why, ie the service dependency graph. In fact you want several views of the service dependency graph, for example the transitive expansion of dependencies both ways for some service: 'everything that has to be up for this service to start' and 'everything that requires this service to be up'. It's also handy to be able to ask questions like 'does service X depend on service Y in some way, and if so how?'

This is mostly missing in SMF and systemd today, as far as I can see. It's possible that systemd can be coaxed into giving you the information, but if so how to do it has been carefully hidden from harried sysadmins.

(As an example of getting it incomplete, SMF will give you direct dependencies in both ways but not indirect ones, which leads to a frustrating game of 'chase the dependency'.)

StartupDependencyInfo written at 02:04:54; Add Comment

2011-12-08

Another reason sysadmins should program

One of the SysAdvent entries I recently read was John Vincent's Always Be Hacking, about how all sysadmins should be able to program. He gives a bunch of sensible, rational, career-related reasons for this, all of which I wholeheartedly agree with. Still, all of that career related stuff is a bit dry and calculating, and maybe you're not really enthused by it (any more than many Unix sysadmins are enthused by Windows, popular or not).

As it happens, I feel that there's a big motivation that John only touched on in passing:

Being able to program is a great way to make system administration more fun.

And I'm not talking about programming itself being fun (it is to some people, me included, but not necessarily everyone).

System administration is about solving problems (including by building things); that's what gives us our kick. Being able to program (and its mirror twin, being able to understand programs) vastly increases both the amount of problems that you can solve and the size of the problems that you can tackle. Every increment of programming capability widens the horizons that you can reach; shell scripts, 'scripting' languages like Python and Ruby and Perl, large system languages like Java and C++, and low level understanding in C and assembler, they all add to what you can build and what you can understand.

If you learned shell scripting on your own as a sysadmin, you've probably already experienced the moment where you suddenly realized that you could script all sorts of tedious stuff that you'd been doing by hand. Full scale programming gives you that moment over and over again, mixed with the feeling that you are peering behind the walls of the world and seeing the gears turning (as in this well-known illustration).

You're a system administrator. What's not to like about being able to solve more and bigger problems, about being able to scale more and more of the walls that inevitably appear in your way? Being blocked by obstacles is frustrating and no fun; getting past them (or not even noticing they're there) is great fun.

SysadminProgrammingFun written at 00:38:26; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.