Wandering Thoughts archives

2008-07-23

Retracting blog entries in the face of syndication feeds

Suppose that you have accidentally published a blog entry that you really didn't want to and now want to retract, unpublish, disappear, or the synonym of your choice the entry. You could just delete the entry, but this has a problem: your syndication feeds (RSS, Atom, et al).

Specifically, removing an entry from your blog and thus your feed doesn't remove it for people who've already fetched a version of your feed that included the entry. Feed readers keep their own copy of entries that they've seen (up to whatever expiry limit the user has set) and no common feed reader will remove an entry just because it's disappeared from a feed, because entries disappear from feeds all the time (since a feed only contains the N most recent entries).

(Feed readers could be coded to notice that an entry is missing at the front or between two others, instead of at the end, but that would take extra effort. And as far as I know there is no marker in any of the syndication formats for 'remove this entry now'.)

If you are quite fast and very lucky, you can catch the mistake and remove the entry before anyone has pulled a version of your blog's syndication feeds that has the entry in it. But you are probably not that fast, especially if you are a popular blog and thus people are pulling your syndication feed all the time.

However, you can take advantage of another feed reader feature: if you change the contents of an entry, pretty much every feed reader will update their copy of the entry with the new contents. So instead of removing the retracted entry, replace its contents with 'this entry has been retracted' or the like. If there's other parts of the entry that need similar retraction (the title, for example), do the same thing with them.

(In theory you could update the entry to have an empty contents and title, but I think that 'this entry has been retracted' looks better to your readers and runs less risk of feed readers deciding that something clearly has gone wrong with your feed and thus they aren't going to update their copy.)

Ideally there would be a way of publishing an entry just in your feed, so your main blog pages don't have a 'this entry has been retracted' entry. I suspect most blog software doesn't support this, so what you can do is first update the retracted entry with the retraction notice, then wait a day or so for everyone's feed readers to pull this updated version, and then remove the entry entirely.

RetractionAndSyndication written at 00:55:16; Add Comment

2008-07-16

The problem with Usenet

There's recently been a little bit of fuss in the news about the New York Attorney General getting various ISPs to turn off the alt.* hierarchy on their Usenet servers, with a number of people are wondering how the ISPs were willing to cave in to this sort of pressure. My guess, based on what I remember from being fairly involved with a Usenet server up until a couple of years ago, is that ISPs were probably just as happy to have a good excuse to cut back on their news servers.

From an ISP's perspective, Usenet has a number of problems:

  • almost all of your customers don't use it.
  • almost all of the customers that do use it are only interested in the binary newsgroups.
  • the binary newsgroups have a huge daily volume. I believe that the last time I heard figures, several years ago, alt.binaries.* was around 500 GB a day and still growing.

    (The Wikipedia Usenet entry cites a figure of 3.8 TB per day as of this April, which is frankly scary.)

  • due to the volume, running a news server that gets a reliable full feed and keeps it for any length of time is quite demanding, as well as being reasonably expensive in hardware and bandwidth.

    (You need a reliable full feed because binaries are generally posted in multiple articles; if your server misses or drops one, the entire sequence is useless.)

  • adding to the problem, much of the volume is various sorts of spam that your customers definitely do not want cluttering up their nice binary newsgroups. Maintaining decent spam filters takes work and expertise.

(Also, allegedly the users that do use your Usenet server will complain vociferously if you don't do an excellent job of carrying a full feed.)

Given all this, it's no surprise that ISPs have been getting out of the Usenet game for some time, bit by bit; many have quietly outsourced it to specialist providers like Supernews or simply stopped providing Usenet entirely. This latest development is merely more noisy than usual. Removing all of the alt hierarchy is a little bit of overkill, but it's probably not like the ISPs care very much and handling alt.* has its own set of headaches.

Sidebar: why dropping alt.binaries.* is actually effective

Normally 'censorship' of this nature is ineffective, as the targeted group just relocates to some other forum, in this case some other newsgroup, and resumes their old behavior. This doesn't work on Usenet because it is very easy to recognize and drop binary postings in non-binary newsgroups, and tools to do it have been commonly available for over ten years now.

(One such tool is cleanfeed .)

UsenetProblem written at 01:00:50; Add Comment

2008-07-13

The problem with big RAID-5 arrays

Let's start by talking about drive failure modes and how they're measured. Hard drives have both MTTF, the rate at which they fail completely, and UER, the rate at which they report an unreadable sector. Drive MTTF is expressed in hours, but drive UER is expressed as a function of how much data is read (technically as 'errors/bits read').

(Typical consumer drive drive UER is apparently 1 per 1014 bits read; 'enterprise' disks improve this to 1 per 1015 bits read. This is of course an average figure, just like the MTTF; you can be lucky, and you can be unlucky.)

The looming problem with big RAID-5 sets is that UER has stayed constant as drive sizes have increased, which means the odds of an unrecoverable read error when you read the entire drive keep rising. Or to put it another way, the odds of an error depend only on how much data you read; the more data you read, the higher the odds.

When this matters is when a drive in your big RAID-5 set fails. Now you need to reconstruct the array onto a spare drive, which means that you must read all of the data on all of the remaining drives. As you have more and more and larger and larger drives, the chance of an unrecoverable read error during reconstruction become significant. If you are lucky, your RAID-5 array will report an unreadable sector or stripe when this happens; if you are unlucky, the software will declare the entire array dead.

(To put some actual numbers on this, a UER of 1e-14 errors/bits read means that you can expect on average one error for every 12 or so terabytes read (assuming that I am doing the math right, and I am rounding up a bit). This is uncomfortably close to the usable size of modern RAID-5 arrays.)

The easiest way to deal with this issue is to go to RAID-6, because RAID-6 can recover from a read failure even after you lose a single disk. To lose data, you would need to either lose two disks and have a read failure during the subsequent reconstruction, or lose one disk and have two read failures in the same stripe, which is pretty unlikely. Otherwise, you need to keep your RAID-5 arrays small enough that the chance of a UER during reconstruction is sufficiently low. Unfortunately, as raw disk sizes grow larger and larger this means using fewer and fewer disks, which raises the RAID overhead.

(Disclaimer: I learned all of this from discussions on the ZFS mailing list and associated readings; I am writing it down here partly to make sure that I have it all straight in my head. See eg here and here for some interesting reading.)

BigRAID5Problem written at 00:41:43; Add Comment

2008-07-12

When overlapping windows do (and don't) make sense

The more I think about it, the more I think that there are a number of conditions that need to hold in order for overlapping windows to make sense (or at least to be attractive). I also think that these conditions are not necessarily all that prevalent, which may go a long way towards explaining why tabs and maximized windows are so popular.

First, small displays lead to maximized windows (and thus tabs) because there isn't enough space to go around. This isn't just resolution, it is also the physical size of the display, since you need windows that are big enough to read no matter how many (or few) pixels are involved.

Once you have physical space and enough pixels to do something useful with it, you run into a paradox:

The best way to use overlapping windows is to arrange them so that they don't overlap.

It's rare that you need to see only part of a window, which is the only time that significant overlap is useful; at other times you might as well iconify the mostly covered up windows. So to be happy with overlapping windows, you need to have enough space and to set up an organizing scheme so that they don't usually overlap. If you don't, overlapping windows will just irritate you (especially if your window system auto-places them in bad ways), and I suspect that this is one reason that many people don't like and don't use overlapping windows very much.

Or in short: the best way to use overlapping windows is to tile them most of the time, and overlap them only occasionally when you really need it.

Unfortunately, setting up an organizing scheme for this is a lot of work. Not only do you have to figure out where you want things to go and how to fit everything in but you have to carefully make sure that all of your windows are the right size and stay the right size, often without very good tools for it.

(I have an entire set of infrastructure to make sure that my terminal windows come out right and for resizing things into the sizes that I have determined tile together the way I want them to. As a result I am fanatical about things like the width of my browser windows, because if they grow too big my tiling scheme breaks down.)

OverlappingWindowsThoughts written at 01:08:28; Add Comment

2008-07-10

Internet software decays and must be actively maintained

While it is true that in general software doesn't wear out, this is not true for software that interacts with the Internet. Internet software definitely breaks down over time, not because of any analog to mechanical wear but because the Internet environment it operates in keeps changing.

There are three sorts of changes in the Internet environment that matter. First, protocols and other programs that interact with you evolve over time. Second, user expectations also evolve. Third and most importantly, the Internet threat environment itself keeps changing, and not just in that people keep coming up with new attacks on your software; over time there are entirely new sorts of bad things that people try to do.

Whether the first two sorts of changes means that your software is broken is a matter that people debate, but as a minimum they are likely to make your software old fashioned and unpopular (at the worst they make it unusable). But I maintain that there is no argument about the third sort of change; at the best your software becomes less and less usable, at the worst your software becomes part of the problem.

This means that software that deals with Internet must be actively maintained. If it is not it will become more and more useless in practice over time, however much it remains theoretically correct, not because it has bugs or security holes as such but because the environment it was designed to work in no longer exists and thus the assumptions it was built on are now incorrect.

(Of course, typical software also has actual bugs that will be discovered over time. But let's imagine some software that is either quite simple or thoroughly debugged already.)

InternetSoftwareDecay written at 01:11:35; Add Comment

2008-07-09

Detailed usage charges versus simpler charging models

If you are going to charge for things, one of the eternal questions is whether you should use detailed usage charges or simpler, 'flatter' ones. My personal thoughts run something like this:

If you have a fixed service that you're expanding slowly (if at all), detailed charges theoretically extract the maximum value for it. However, they do so at the cost of driving away users who don't like the unpredictability of their bills, and every so often you will have something go horribly wrong that sticks a user with an absurd bill (and sticks you with bad publicity if you try to make them pay it).

(In some situations this discouragement is fine, because you don't really want people to use the service in the first place and you're only offering it because of a historical accident. This is more or less how one part of the university wound up having probably the highest commercial timesharing rates in the world, before we got management approval to end the service entirely.)

Flatter rates give users more predictability and there's lots of evidence that they attract people as a result. They also simplify your capacity planning under suitable assumptions and thus your growth, because you basically sell capacity in advance; if you get a sudden surge of subscriptions, you can easily see how much more capacity you need. However, I think that overselling your capacity is easier (and certainly more common) with simpler charges, partly because new customers are being actively promised something and partly because you can get caught out if people's usage patterns change.

(For instance, our simple charges for disk space are partly based on how much backing things up costs us, which depends on how fast backed up data changes. If people started buying terabytes of backed up disk space that changed every day, we would have problems.)

PickingChargingModels written at 00:41:45; Add Comment

2008-07-05

How OOXML is a complete failure, even for Microsoft

There are two possible uses for Microsoft's attempt to make OOXML into an international standard: documenting Microsoft Office's file formats so that third parties can more easily write programs that deal with them, and letting Microsoft claim with a relatively straight face that Microsoft Office uses a standardized file format instead of a proprietary one.

As lots of people have pointed out at great length, examination of the draft OOXML specification shows that it does not actually document the Microsoft Office document formats in enough detail for anyone to use it as a reference for reading or writing documents. In practice, they at most give you a leg up and perhaps point out where you are going to have to reverse engineer whatever the relevant Microsoft products are doing. Thus the first use is a failure.

The problem with the second use is that the ISO process introduced changes from the OOXML draft that Microsoft submitted (Tim Bray covers this here). Since Microsoft released Office 2007 well before the ISO process completed, Office 2007 reads and writes 'draft OOXML', not 'ISO OOXML' (the international standard), and Microsoft can no longer claim to use a standardized file format.

Result: total failure; ISO OOXML is both useless to third parties and useless to Microsoft itself. All that it managed to do was make a lot of people think rather badly of Microsoft (again), which does not help with one of their problems.

(It's so useless to Microsoft that Microsoft Office will add support for ODF, the competing open document format, well before it reads and writes ISO OOXML.)

(This is not a new insight; I just feel like writing it all down in one place.)

OOXMLFailure written at 00:25:13; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.