Wandering Thoughts archives

2010-04-25

An observation about Twitter (and Google)

Here is something I've noticed lately: when I want to look for traces of some recent or just-breaking piece of news, such as IPSCa's utter failure or important potential Solaris licensing changes, I no longer bother trying Google or other general search engines; instead, I go to Twitter and search there. And it works. There's a certain amount of noise in the result but also a lot of signal, more than I could easily get otherwise.

Partly I think that this is because all of the existing blog search engines basically suck; they are some combination of incomplete (some to the point of being jokes), unusable, or overrun by spam. I was about to say that general search engines are mostly useless, since I specifically want recent updates, but actually at least Google now has options for that; sadly, a test search suggests that this is less than entirely useful for anything except very specific terms when combined with 'order by date'.

In large part, though, it is because the sort of activity that happens on Twitter and how it's organized fits well with what I want to learn, provided that it's something people tweet about. Twitter searching is inherently newest-first (at least right now), and it (mostly) consists only of people actively writing about things, with little or no automated tweets or spam. It has drawbacks, such as people don't necessarily link to primary sources of information, but right now they're small compared to Twitter's advantages. (And just knowing that people are talking about something can be useful information.)

I'm not certain that a good blog search engine could overcome Twitter's advantages. In fact, I sort of think that a good blog search engine these days would have to search Twitter too, because it seems that Twitter is increasingly where quick link-blogging goes and for my purposes I need to capture both people writing about the subject and people linking to stuff about the subject.

I'm sure that all of this says something about how information is shifting around on the Internet, but I'm not sure exactly what it says. (To the extent that it de-emphasizes blogs, I'm vaguely sad; I like blogs. But I think it's undeniable that common implementations are a bit heavyweight for short things and just noting links.)

TwitterThought written at 02:42:40; Add Comment

2010-04-13

The impact of single-disk slow writes on mirrors (and other RAID arrays)

We had a performance problem today that we fairly rapidly determined came from our mail spool filesystem somehow being too slow. This was pretty impressive, because our mail spool sits on a four-way mirror on our fileservers and we could easily see that our IO rate on its fileserver wasn't particularly large, well below what we knew things could manage.

(Using NFS with a single interface makes IO rate easy to measure directly, no matter how complex your IO system; just look at the network traffic volume.)

Intuitively, I have traditionally expected RAID mirrors to perform well even if one disk is busy or slower to respond than others, so that as long as you had enough mirrors you could expect decent performance more or less no matter what happened. While this is true for reads, it is not true for writes; mirror writes wait on the slowest disk, because in a conventional mirror setup the write must hit all disks before it's considered complete. This means that a single disk that has slow writes can drag down the (write) performance of an entire mirror array, no matter how many ways your mirrors are, as eventually all of your write traffic bottlenecks for that single disk to write things out.

(This happens immediately with synchronous write IO but can also happen with asynchronous writeback IO if your sustained write bandwidth is above what the slow disk can handle. You can also be hit with read slowdowns in some situations.)

This is what happened to us today. Various programs make lockfiles in our mail spool, creating a modest amount of write traffic, and one particular disk was being hit by a significant write IO load from another source. It got just slow enough to start backlogging write traffic, and eventually everyone piled up on it in a massive (and surprising) traffic jam.

(This also happens with RAID 5 and RAID 6 arrays, which clearly have to write to all disks in order to make the stripe parity work out correctly.)

This isn't the first time that this has happened to us, and it can be relatively subtle; we've had filesystems that felt kind of slow, and the culprit turned out to be that one of the disks that they were mirrored to wasn't quite up to the write load that it was being asked to carry. Clearly this is something that I need to remember the next time we have an oddly slow mirror or RAID set.

(The problem can come and go, too, based on how much write load you're putting on the filesystem at the time.)

RAIDWriteImpact written at 00:56:20; Add Comment

2010-04-03

A DVCS advantage for open source development

Recently, an interesting advantage of DVCSes for open source development has occurred to me: their very nature makes it so that the initial source of an open source release cannot really reverse itself.

Suppose that you are a company that might want to retract and de-release something that has been released as open source. With traditional non-distributed version control, you could simply shut down the public source server for the project; while people who already had copies of the source base could in theory put together another public source server, it would be a moderate hassle and they'd lose the project history. With a DVCS, everyone has what you have and setting up a new public source server takes about as long as pushing the source to one of the hosting services such as GitHub.

This is conventional, but DVCSes give you even stronger protection than this. Put succinctly, since DVCSes do not allow 'rewriting history' using a DVCS means that the project lead cannot commit things to the tree that destroy the history that you already have copies of. With a non-DVCS, a clever company could commit things to the tree to wipe things out or otherwise destroy the source tree's usefulness and then wait for people to update; since you only have the current state of the tree, you could be stuck. With a DVCS, even if the project lead first commits total garbage over the tree state and then commits a removal of all files, you can roll back your own repository to before the damage. Even if the master repository is cleaned out by force, there is no way for the project lead to reach into your repository copy and destroy things.

(Well, in theory. Depending on the specific DVCS, it might be possible to do things like rewrite tags and branch labels, so that while you had the raw data it'd be hard for you to find it.)

All of this is because the entire design of modern DVCSes is about never rewriting or removing things that already exist. There is no way to retract or overwrite things, and while this is sometimes inconvenient and problematic, it does give you a fairly strong immunity from the project lead changing their mind or going crazy.

As an aside, this leads me to feel that the really important thing is thus not the project's source repository but the project's communications infrastructure; its website, its forums, its mailing lists. If all of those were shut down abruptly, sure, you could put up a new master source repository but how would you tell people about where to find it and then get back in touch with the developers? The mailing list you'd use to do so is also shut down, and you probably don't have the subscriber list to put together a new version.

(If you've kept a local archive of the project mailing lists, you can at least assemble the email addresses of recent or frequent posters to the mailing list; this would help to get the word out. Otherwise, well, you'd have to start spreading the word however you can, although I suppose it's common practice to have usable email addresses in DVCS commits so you could mine that to get developer email addresses.)

(I doubt I'm the first person to notice this.)

DVCSAndRetraction written at 02:29:28; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.