Wandering Thoughts archives

2010-05-30

The end of university-provided email is probably nigh

One of the things that has happened around the university lately has been a review of our campus-wide student email offering. You can probably guess the result; pretty much everyone involved (especially the students) was very firmly in favour of moving to GMail or some competitor, provided that various side issues could be adequately dealt with.

(Yes, some of these side issues are not trivial. But they are apparently significantly smaller than they used to be, and addressable.)

There's a part of me that thinks this is sad; what has life come to when the university can't offer an attractive and competitive email service to its students? But that's the wrong view, because what's happening is not that the university is being cheap but that email has become a big business. Google, Microsoft Hotmail, and others are pouring far more resources into developing excellent email environments than pretty much any university can afford to spend.

And fundamentally, we should expect Google, Microsoft and so on to be better at software development than the university is (if they are not, something horrible is wrong with the world). It is their business, after all, and they work quite hard to make their business work as well as possible. Including by attracting good programmers, which are far more likely to go to Google than to accept a job hanging around the university.

In short: that the university could provide a competitive email service in the past was just an artifact of the fact that no one was trying very hard to offer a good email service. That's changed, and I can't be too sad about the fact that students are probably going to get a better service out of the result.

(One can argue whether or not students should care about the things that they will be giving up with third party cloud-based email, but in practice they don't seem to care. And the university is not really in a position to forcefully protect them from these potential problems, not unless we want to get really draconian and unpopular (and spend a bunch of money).)

Sidebar: why has this situation changed?

One might ask what's changed to cause people like Google to offer a good email service. I think that a big part of the reason is that web-based email clients got good, and in fact often better than desktop MUAs.

When a user's MUA is their interface to email no matter who provides it, the playing field is relatively level for service providers and their programming resources can only do so much to make them more attractive. Web-based email changes this, because suddenly you can spend programming effort on client improvements that only benefit you. Thus, how much development resources you can throw at your service suddenly matters a lot.

(You can argue about spam filtering, but many webmail providers were historically not very good at it and one can outsource that to a vendor and get quite good results.)

UniversityEmailEnd written at 20:41:11; Add Comment

2010-05-29

UPSes: defense against problems, or sources of them?

Here is something that we have been forced to think about lately: are UPSes really a good insurance policy against power problems, or are they instead an extra source of problems? In short, does using UPSes really increase your net reliability?

The problem with UPSes used by themselves is that they are another piece of machinery to fail (and they are a moderately complicated piece of machinery at that). And UPSes do fail; for example, we recently had an incident where a UPS reset itself out of the blue, briefly dropping power to everything connected to it (and it was not a power overload situation).

(Even when they don't fail outright, UPS batteries eventually age into uselessness and must be replaced, which generally requires you to take the UPS out of service.)

So the real question is what the MTBF of UPSes is compared to the mean time between power failures. For us, the mean time between power failures seems to be very large and visibly larger than the MTBF of our UPSes; since we put our current crop of UPSes into production we have had no power failures and at least one UPS failure. At the moment this appears to make UPSes a net negative, in that we are more likely to have power problems caused by UPSes than by actual power loss.

The way around this is to arrange for the UPS not to be a critical path component, so that if it fails things don't go down. However, this takes extra hardware for every machine; you need dual power supplies or the equivalent, so that you can have the machine still getting power even if the UPS fails. This is generally somewhat expensive.

(You can apparently get external power units that give you dual power sources, so that you can protect even 1U servers, basic switches, and other things that don't normally have an option for dual power supplies.)

When you want to spend extra money, you wind up asking yourself how much extra uptime your money is buying you. If power failures are extremely rare the answer may well be 'not much'. Certainly this issue has given us some things to think about.

(Paying extra for genuine UPS insurance, dual power supplies and all, may be worth it if it lets you run machines in otherwise unsafe configurations for extra performance, for example having disk write caches turned on. But this probably turns it into a question of how much the extra performance is worth to you, not how much the reliability is.)

UPSCausingProblems written at 23:22:30; Add Comment

2010-05-25

Watch out for quietly degrading SATA disks

In an ideal world, disks would either be working completely normally or obviously broken (either producing errors or being completely dead); if a drive wasn't actively reporting problems, you could assume it was working fine. I am here to tell you that sadly we do not live in that world, at least not with SATA drives.

What we've now seen several times is SATA drives that degraded quietly; they didn't particularly report errors, they just started performing terribly (by their usual standards). The most recent case was a 1 TB SATA drive whose sequential read rate off the raw disk dropped from 100 Mbytes/sec to 39 Mbytes/sec, but we've had others (and from multiple vendors), and I've seen similar reports from other people.

(At least in our case there were no warning signs from SMART reports, although the disk did report a read failure recently (not during the speed tests, I'll note). Possibly that counts as a very bad sign these days; I'm certainly aware that write errors are, as they mean that the disk has exhausted its ability to spare out bad sectors.)

Clearly, sometimes modern disks either fail quietly or just go bonkers. Equally clearly we can no longer count on status and error monitoring to turn up disks with problems; we're going to need to put together some sort of positive health check, where we periodically test disk performance and start raising alarms if any disk comes in below where it should. Making this reliable in the face of regular production IO to the disks will be interesting.

(It's possible that some of our apparently bad disks would be fine after being power-cycled and cooling down and so on. Re-testing the most recent failed disk is on my list of things to do sometime, to see if this issue is persistent. As a transient issue there are all sorts of possible explanations ranging from firmware bugs latching the drive into some peculiar state to excessive vibrations (we're now learning that these can visibly degrade drive performance). As a permanent issue, well, it could be something like too much bad sector sparing in action; I'm not certain if our current SMART monitoring software notices that.)

QuietSATADegradation written at 23:22:22; Add Comment

2010-05-16

A theory about our jumbo frame switch firmware bug

Last entry I mentioned that I now had a theory about our odd switch failure with jumbo frames, where after a power cycle the switch would start doing jumbo frames remarkably slowly until you went into the configuration system and re-selected the 'do jumbo frames' option. This is theory.

As I've mentioned before, modern switches have two parts; a high speed switching core and a slower management processor that handles everything else. If the jumbo frames weren't being handled by the switching core but were instead being passed up to the management processor, you could expect things to work but be very slow, which is just what I saw.

So how could things get that way? My theory is that the code that configured the switching core on boot was doing an incomplete job of enabling jumbo frames; it told the switching core to accept them, but didn't turn on everything that was needed to have the switching core actually switch them. The code that got run when you turned on jumbo frames in the configuration system did do the full setup, hence explicitly 'enabling' jumbo frames in the configuration interface suddenly making them work at full speed.

(This theory also leads to a decent story about how the switch passed the vendor's testing, since most testing starts from factory default settings.)

One of the things that this reinforces for me is that modern hardware is not just hardware; it has a lot of non-trivial software embedded into it. This matters because software generally has much more complicated failure modes than physical hardware, which means that even what we think of as simple hardware can behave very oddly in narrow circumstances.

(The poster child for this is hard drives, which now run a scarily large amount of onboard code to do increasingly sophisticated processing, more or less behind your back. All things considered, I am sometimes impressed that modern HDs work anywhere near as well as they do.)

SwitchJumboTheory written at 00:55:22; Add Comment

2010-05-15

Why we don't use jumbo frames for iSCSI: a cautionary tale on testing

When we were initially designing our iSCSI SAN environment, we planned to use jumbo frames; it was just an obvious thing to do. Then we tried to find an inexpensive 16 or 24 port gigabit Ethernet switch that actually did jumbo frames completely correctly.

To skip to the punchline, we failed. Worse, in the process of failing it became obvious that jumbo frames were a dangerous swamp. (Since jumbo frames did not get us any real performance increase, we didn't then try expensive switches.)

The easiest switches to deal with were the ones that didn't support jumbo frames and said so. More annoying were switches that claimed to support jumbo frames but were lying about it. The worst and most dangerous switches were ones that supported jumbo frames, but badly, and the champion switch at this had a really creative bug.

This particular switch did jumbo frames fine, passing traffic and running just as fast and reliably as you'd expect; I was quite happy, because I thought I'd finally found a winner. Then I power cycled it for an unrelated reason, and immediately jumbo frame TCP performance dropped off a cliff, going from wire speed to under a megabyte a second (but it didn't break completely). The switch configuration claimed that jumbo frames were enabled; re-enabling them suddenly cured the performance issues, at least until the next power cycle.

(This was clearly a firmware bug, and in retrospect I can guess where it was, but it made the switch useless; we couldn't use a switch that would effectively kill our entire iSCSI infrastructure after a power failure until it was fixed by hand.)

My experience with this switch convinced me of two things. First, it was going to be hard to find what we wanted; we were going to spend a lot of time auditioning and testing a lot of switches for not very much gain. Second and more important, jumbo frames were dangerous because they made switches have obscure failure modes. We ran into the power cycle issue mostly through luck; what if the next jumbo-supporting switch had an equally obscure issue that we didn't stumble over in testing and that only blew up in our face after a few months (or years) in production?

(Note that this is one of the cases where hindsight makes everything look obvious. Of course you should test power cycling a switch, it's so obvious. Except that it's not, because a switch that loses part of its configuration over a power cycle should never have gotten into production in the first place.)

What this really points out is the difficulty of fully testing something of even moderate complexity (at least from the outside, in black box testing). Part of it is the sheer size of how much you need to test, but a large part of it is our natural habit of jumping to assumptions in order to simplify our lives, especially if those assumptions are the way things have to be in order to work right. You can defeat these assumptions and the resulting mental blind spots, but only if you work hard at it, and you will wind up with a lot of work, most of which won't get you anything.

(And really, how much time is it sensible to spend doing tests that just tell you that equipment works the way it's claimed to, down to the obscure corner cases? In most environment this is like many-nines reliability; after a while, you hit the point of rapidly diminishing returns and you have to start making assumptions.)

JumboFramesAndTesting written at 01:49:11; Add Comment

2010-05-12

'Borrowing' IPv4 netblocks to get around address space exhaustion

I recently read a news story speculating that a black market in IPv4 address space would develop as the IPv4 address space became exhausted (which is allegedly happening fairly rapidly). In reading the story, it struck me that we could see an even more interesting and evil trick used by sufficiently desperate and underhanded organizations: just borrowing unrouted netblocks.

The trick goes like this. First, find a suitably sized netblock that is allocated but appears unrouted; then, find yourself a compliant ISP and get them to 'accidentally' announce and route the netblock for you. Who is really going to notice? And if they do, your ISP can always claim that it was an accident, since people screw up routing announcements all the time anyways. You'll have to get a new netblock (or a new ISP or both), but this is better than not being on the Internet at all.

(See, for example, the Renesys blog, which has covered various hair-raising accidents. Sadly, this sort of netblock hijacking is already routine technology; the trick is used by spammers to completely hide their tracks.)

Whether this is a viable trick depends on how much allocated but unused network space there is. My impression is that there is a fair amount of network space that various organizations got back in the early and mid 1990s (when the rules were much easier) that are not actually in use on the public Internet, either because the organization is now defunct or because people are sitting on the allocated address space in case they need it later.

(After all, IPv4 address space is getting scarcer and scarcer; if you were smart enough to get a /24 for yourself back in 1990, would you let it go? My understanding is that ARIN has no way to claw back such old legacy allocations, although I may be wrong by now.)

Would this ever get done for real? I honestly don't know. I'd like to think that it wouldn't, but at the same time if the IPv4 address space does get exhausted, there are going to be some desperate people. Sooner or later there will be startups and small companies that care less about doing it right than doing it at all.

BorrowingIPv4Space written at 02:29:09; Add Comment

2010-05-03

Keeping track of filesystem consistency

In light of my last entry, here is an interesting question: when do you know that a filesystem is consistent, and how much work does it take for the system to keep track of this?

First off, there are some easy cases, namely filesystems with journaling, strong ordering guarantees, or copy on write properties.

In general, copy on write and journaling filesystems are supposed to be consistent all of the time unless the kernel has detected that something is wrong and flagged the filesystem as damaged. Instead of these approaches, some regular filesystems carefully order their updates so that they are always consistent or at least sufficiently close to it (so called 'soft updates'). In all these cases, keeping track of the consistency itself is essentially free; the operating system mostly needs a flag in the filesystem to say that errors have been detected, and this will be rarely updated.

(Technically journaling filesystems are only consistent if you replay the journal after any crashes; if you look just at the filesystem state and ignore the journal, it may be inconsistent. This sometimes causes problems. The problem with soft updates is their complexity and also the need to clean up leaked space at some point, although there are promising ways around that.)

Once you get to regular traditional filesystems, things are much more difficult. The semi-traditional Unix view has been that filesystems are inherently inconsistent if they are mounted read-write; they are only (potentially) consistent if they were cleanly unmounted or mounted read-only. This has the virtue of being easy for the system to maintain.

You can do better than this, but it takes more work and in specific it takes more IO. The simple approach is to maintain a 'filesystem is consistent' flag in the filesystem; the operating system unsets this flag before it begins filesystem-changing IO and sets it again afterwards once things are quiet. However, this is going to happen a lot and each unset and set again cycle adds two seeks to your IO operations, especially at the start (if the filesystem is marked consistent, you absolutely must mark it inconsistent and flush that to disk before you do the other write IO). This is a not insignificant amount of extra work in both code and IO, and adds latency in some situations, which is one reason why I don't believe that any Unix systems have ever tried to do this.

(I don't know if other operating systems have tried such schemes. These days I'd expect everyone to just implement a journaling filesystem.)

ConsistentFilesystems written at 01:45:19; Add Comment

By day for May 2010: 3 12 15 16 25 29 30; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.