Wandering Thoughts archives

2008-10-28

One reason why people buy Ethernet taps

There are a number of people who will sell you rather expensive network tap boxes (eg here). Since hearing about them and discovering the actual prices, I've felt that traffic monitoring switches with mirroring ports (despite the VLAN issue) and dual-NIC PCs running bridging software made them pointless except for people with very high end needs; they were neat in a theoretical way, but not something we would ever need in practice, since the alternatives were perfectly good.

(There is a lot of elaborate equipment that would be cool to have around but I must reluctantly admit we don't exactly need.)

Let me retract that blithely optimistic view of mine.

We have lately been attempting to debug a switch issue involving performance problems with traffic between a 100 mbit machine and a gigabit machine, and we think that part of the problem may be related to inter-switch flow control issues (specifically, who does it when). As we've been discovering, the problem with monitoring switches and bridges is that they are not completely transparent; both switches and bridges change the layer 2 behavior of the network, things like how pause frames or STP broadcasts are handled, and often at a level that's too low for you to really monitor or influence.

(At least some switches generate or don't generate pause frames on links based on low-level negotiations with whatever is on the other end of the link; put a bridge in, and you may have just changed what gets negotiated. Plus, pause frames do not pass through bridges, or at least not through the bridge implementations that we've been trying to use.)

Most of the time this doesn't matter and you don't think about it. But right now this matters quite a lot to us, and it has been very frustrating to find out that there is basically nothing we can do to monitor our testing to find out what is going on, because anything we add to the test environment changes the behavior (or at least could be doing so, which means that we can't trust the results).

Let me tell you, network taps are looking awfully tempting right about now. (We probably still can't justify the expense, though; this is hopefully a one-time problem.)

TheNeedForNetworkTaps written at 01:22:38; Add Comment

2008-10-26

Why RAID-1 is the right choice for our new fileservers

Our old SAN was set up in the traditional way: the SAN backend units did RAID-5 interally, and this RAID-5 space was carved up into LUNs and used by the frontend fileservers. For natural reasons all of our fileservers wound up using LUNs from all of our SAN backends. This setup has low overhead, decent resilience, and decent performance. Our new fileservers are set up in an entirely different way, and among other things they use RAID-1 instead of RAID-5. Although we were driven to adopt a RAID-1 approach by other issues, it has turned out to be entirely the right choice (despite the space overhead).

The problem with our old environment was what I will call 'IO contamination'. In practice, any substantial IO to any LUN on any of the RAID-5 arrays touches all of the disks in the array, which means that it contends with and affects any other IO happening to any other LUN on the array. This is especially important because multiple streams of IO are quite likely to force seeks, and the weak point of all current disks is how many separate IO operations a second that they can sustain. Thus, since all of our fileservers used each backend, significant IO load on one LUN on one array could slow down many filesystems on all fileservers.

(The most glaring place that this showed up was parallelizing backups. Attempting to balance the IO load was basically impossible, so we had to just hope for the best by telling Amanda to run a few backups per fileserver.)

Did I mention that the different filesystems were owned and used by all sorts of different research groups and professors?

The great advantage of RAID-1 for us is that it makes IO traffic for different things genuinely separate; with only a few exceptions (such as if we max out a network port's bandwidth), IO to one group's space doesn't affect IO to another group's space. In practice, this independence gives everyone significantly better performance (and it has certainly sped up our backups a lot). And if there ever are performance problems, figuring out the cause is going to be much easier, because it will actually be possible to work backwards from 'hot' disks to find whatever is creating the load.

The drawback of RAID-1 is that it does cost more. Fortunately the cost of disks is dropping all the time, especially if you build your SAN out of commodity hardware.

(Since we are basically doing random IO, RAID-1 also has a straightforward performance advantage; it significantly increases how many spindles we have, and the rule for random IO is that the more spindles you have, the better.)

WhyRAID1IsRight written at 03:24:25; Add Comment

2008-10-25

How we worked out the partition sizes for our SAN

Our new fileserver setup requires that all of the LUNs exported by the backend SAN be exactly the same size, down to the block, so that we can always mirror two arbitrary LUNs together. We wanted our basic LUN size to be around 200 to 250 GB, but that's a large range of possible sizes, and it still left us to pick a sensible exact block size.

Our first approach was to take our 750 GB Seagate SATA disks and divide them up into three equal slices. While temptingly simple, this is dangerous, because different 750 GB drives from different vendors are slightly different sizes, and for that matter the 1 TB and 1.5 TB drives are not necessarily going to be exactly 4/3rds or 2 times larger than our current 750 GB drives. We thought about compensating for this by leaving some amount of spare space as a margin for error, but that only changed the problem to how much spare space we should leave.

(At this point it becomes important to remember that drive vendors quota sizes in 'decimal' gigabytes, where one dKB is 1,000 bytes, one dMB is 1,000 dKB, one dGB is 1,000 dMB, and so on.)

For our second attempt we started from first, well, assumptions (I cannot call them principles). The one thing we can be relatively confidant of is that disk vendors will not be able to sell a '750 GB' drive that has less than 750 (decimal) GB of capacity, and similarly for 1 TB and 1.5 TB drives (if nothing else, it sails awfully close to misleading advertising). So we can achieve our goals by making our slices be 250 (decimal) GB or slightly under that.

This gave us a starting figure in (real) KB; however, it ignored any overhead from Linux partitioning and had no margin for error. So we rounded it down slightly; specifically, we rounded it down so that it was evenly divisible by 32 MB. We chose 32 MB so that we could create LVM volumes that were exactly this size and thus could use LVM on our SAN backends if we some day needed to.

(LVM volumes must be exact multiples of the LVM physical extent size, which has to be a power of two and wants to be large for efficiency. 16 MB is the default.)

(To save you the math: 250 decimal GB is 244140625 KB, and we rounded down to 244121600 KB. This 'wastes' 18.5 MB per LUN, plus however much a disk is larger than exactly 750 decimal GB, 1 decimal TB, and so on. The disks we've looked at so far are not substantially larger than advertised, so this is basically a trivial amount of space.)

Trying to create partitions that were exactly this size is where I got to find out about the effects of various fdisk options.

SANPartitionSizes written at 02:17:58; Add Comment

2008-10-22

How Amanda knows what restore program to use

Here's an error message we got from Amanda during a recent restore attempt:

warning: restore program for /slocal/amanda/bin/ufsdump not available.
amrecover couldn't exec: No such file or directory
  problem executing restore
amrecover: amrecover: pipe data reader has quit: Broken pipe

First the background: the filesystem that we were attempting to restore a file from had recently been migrated from our old fileservers to one one of our new Solaris fileservers; the backup we were restoring from had been made on an old Solaris fileserver, and was being restored on the new fileserver. (Our old fileservers use a cover script for Solaris ufsdump for local reasons.)

What Amanda is complaining about here is that it doesn't know what program to run to actually restore files from the backup. In fact it wound up taking a guess, but the guess didn't work.

(Amanda doesn't have its own format for backups; instead it relies on things like GNU tar and ufsdump, and just ships around and manages their output. This means it needs to run an outside program, the 'restore program', in order to actually retrieve things.)

When it makes a backup, Amanda also records the 'type' of the backup and the program that was used to generate it. Unfortunately, Amanda only has three types: Samba backups, tar backups, and a general type for 'dumps', of which there are at least four sub-varieties. Amanda guesses which sub-variety of dumps it is dealing with (and thus what program it should use to restore them) by looking at the program that made the backup. If it doesn't know the program at all, Amanda falls back to the hard coded default name 'restore'.

(You can see what any particular copy of Amanda thinks these programs are by examining the debug logfiles for amandad; they're part of the configuration parameters that get reported on startup.)

In our case, Amanda on the new fileservers was never configured to use our old fileserver cover script as the '(plain) dump' program, and so it didn't know that it should use ufsrestore (its 'plain dump' restore program) to restore it. There's two options for a solution:

  • reconfigure Amanda on the new fileservers to use the same nominal ufsdump cover script. This is undesirable for various reasons, but is temptingly easy (and we don't need to do UFS dumps on the new fileservers).
  • put a 'restore' program (either a cover script or just a symlink to ufsrestore) somewhere in our $PATH when we do Amanda restores.

I suspect that we'll wind up adopting the first solution.

Update: I was wrong about the second option; you have to do it somewhat differently to make things work. See AmandaRestoreProgramsII.

AmandaRestorePrograms written at 01:43:15; Add Comment

2008-10-09

We've lost the password battle

It's been an article of faith, frequently professed to users, that they should never write down their password or otherwise record it. Your users probably profess to follow this, and may even honestly believe that they are.

But find a user with a machine that recently rebooted (this often doesn't take long) and watch what happens next, as the user re-establishes their environment and restarts their applications. Did they get asked for a password when their IMAP-based mail program started, or is it happily fetching mail? How about their CIFS-based shares, did they get asked for that password when they started talking to your Samba server?

Probably not.

Do you use separate user passwords for each of those services?

Almost certainly not. (At least around here, the users would likely lynch us for trying that. And it wouldn't really matter if we used a separate password for these services than for people's Unix login; the net effect would be to make even fewer people log in to our Unix servers, with no decrease in an attacker's ability to do damage.)

If you are lucky, your users have some sort of master password that unlocks their machine. If you are really lucky, all of their applications are using a single secure password store, instead of putting together various ad-hoc solutions to the problem (or just storing passwords in barely encrypted form and ignoring the issue).

(By the way, try not to think too much about the effects of having your webmail system. You'll sleep better.)

LostPasswordBattle written at 01:06:21; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.