Wandering Thoughts archives

2006-03-31

The perfection trap: a lesson drawn from Worse is Better

I've mentioned Richard Gabriel's famous The Rise of "Worse is Better" before (back here), but only recently did one of its important lessons coalesce in my thoughts.

"Worse is Better" contrasts what Gabriel calls the MIT approach, with the cornerstone of 'the right thing', against the New Jersey approach, with a 'worse is better' minimalism (this is a simplified summary). Gabriel argued that despite its flaws, the New Jersey approach had significantly better survival characteristics than the MIT approach, for reasons he described and I'm not going to try to repeat here.

The MIT versus New Jersey divide can be portrayed as a choice between the right thing and a so-so thing that's maybe good enough. When you put it that way, a lot of people will naturally tilt towards the MIT approach; who doesn't want to do the right thing? But this is wrong.

The perfect is the enemy of the good. - Voltaire

In reality it's not actually a choice between right and worse; it's really a choice between nothing, worse, and right. And over and over, aiming for right has been an excellent way to wind up with nothing (for reasons that Richard Gabriel outlines nicely).

The easiest place to see this is computer security, where insistence on perfection (or some excellent approximation) is one of the holy tenets. As a result we have a few very secure systems and a lot of almost completely unsecured ones.

PerfectionTrap written at 17:05:09; Add Comment

The difficulty of punishing people at universities

One of the quiet little secrets of university computing is just how difficult it is to actually punish people for doing bad stuff with computing resources. Really bad stuff, things that are criminal or have serious civil liabilities, can be punished. But mere violations of policies or bad network behavior (including spamming) can run into a series of problems.

Tenured professors might as well be the left hand of God, of course, especially if they get grant money. But even students (grad and undergrad both) are heavily protected, because many universities have strict policies on imposing 'academic sanctions'; this almost always includes not just direct loss of marks but also anything that is necessary to pass the course. This makes removal of computer access an academic sanction in many cases, subject to the requirements and the elaborate procedures.

Staff are theoretically the least protected, except that removing someone's computing access often makes them unable to do their job, which is not popular (to say the least) with their management chain. This can result in the only real options being either a slap on the wrist or a firing, and firings are often a hard sell (and often require their own large set of procedures, time, and repeated incidents).

(To be fair, the staff issue is probably the same for companies.)

This isn't to say that stern computing policies and AUPs aren't useful; if nothing else they can be used to scare people. But for some time I've wondered what we'd be able to do if, for example, someone started spamming for a religion and showed no inclination to stop.

(The more likely scenario is probably an undergrad that likes poking things with sticks; there is certainly no shortage of places to irritate and troll on the Internet.)

UniversityPunishmentProblem written at 01:34:37; Add Comment

2006-03-23

Atom versus RSS

David Heinemeier Hansson in the comments on one of his entries:

Joe: Atom is just RSS without the bugs. [...]

What he said.

The more I've learned about syndication formats, the more thankful I've been that I picked Atom way back when. (I'm not sure why I chose Atom; possibly because it seemed the more up to date of the choices at the time, since it had an RFC in development and all.)

The difference between Atom and RSS is that Atom has a real specification, one good enough that people actually write to it. So I can use the specification to write a useful feed generator, which is pretty much what I did for DWiki (with some help from the feed validator).

With RSS, the formal spec is unclear and incomplete, so in practice RSS is defined by what popular feeds and feed readers do. This had led to various problems and dark corners, where sometimes nothing you can do is going to work for everyone. (I'll stop footnoting that now; I could go on all day.)

In my opinion, voodoo programming is no way to run a railroad. So I am really glad I went with Atom; I would probably have found dealing with all of the RSS issues to be teeth-grindingly frustrating.

(The ongoing RSS soap opera doesn't help either, but mostly it makes me glad I am nowhere near the blast radius.)

AtomVsRSS written at 03:34:20; Add Comment

2006-03-01

Unicode is not simple

Unicode is very big these days, and there are a lot of people who will tell you that Unicode is simple once you take the effort to understand it and that you're a parochial spud if your new program doesn't support it. Unfortunately, only simple Unicode is easy.

The simple vision of Unicode is that once you have your data in Unicode, you're just dealing with characters and your program can just do stuff with them as normal. Okay, case and collation is hard, but there are OS and library services for that, and rendering has some interesting challenges, but again the OS people have done that for you.

The problem is that this simple vision isn't true. It would be true if there was a one to one mapping between Unicode codepoints and glyphs, but at least three things I know about get in the way of that:

  1. zero-width formatting characters, which mean some codepoints aren't glyphs.
  2. combining characters, which mean that there are multiple ways to make the same glyph.
  3. Han unification, which means some codepoints have to display significantly different glyphs depending on what language the text is in.

(In Unicode terminology, a codepoint is a single bit of Unicode, like U+0061; a character is the abstract thing that this codepoint represents, like LATIN SMALL LETTER A; and a glyph is the (abstract) visual representation of a character, like 'a'.)

Zero width formatting characters are things like zero width spaces or text direction markers. They make it harder to divide up or truncate words (you'll get very funny results in some cases) and easier to create strings that are represented differently but look the same to users.

Some glyphs are represented not by a single Unicode character but by a base character plus combining characters such as accents. Many common accented glyphs can be represented more than one way; there is a 'precomposed character' for them, plus one or more composite forms. For example, å is both U+00E5 and U+0061 plus U+030A.

Among other things, this means that a correct program must normalize Unicode strings before comparing them, using its choice of four different normalization forms (see, for example, here; read the comments, they're informative). Also, your code can't just blindly lop codepoints off words to do stuff; if you do, you can turn Årne Svensen into A. Svensen, and he may not be too pleased with that.

(See also the Unicode normalization FAQ or Markus Kuhn's FAQ.)

The really troublesome one is Han unification. As part of Han unification, the same codepoint was assigned to the same logical character in the CJK languages, even if different languages used somewhat different glyphs for the character. For example, U+8349 is the 'grass' character for all the CJK languages, but Traditional Chinese uses a different glyph for it. Thus, to properly display something that includes U+8349 to users, you must know what language that section of text is written in. This isn't just a theoretical issue; this LiveJournal entry shows the sort of things that do happen in the real world because the language of Unicode text isn't marked.

(Pop quiz: which version of 草 is shown here for you?)

The really dangerous thing about the 'Unicode is simple' meme and simple Unicode handling is that it usually works, especially in Europe. Most of the time you will get Unicode that uses precomposed characters instead of combining characters, or at least has the combining characters in a normalized form. Most of the time Peter Påderson's system will encode his name the same way. Most of the time you will be dealing with monolingual text in the user's own locale, where Han unification issues won't matter. Most of the time you won't get input with deliberately introduced zero width formatting characters.

Most of the time.

Most of the time is no way to tell people to program.

Sidebar: More on Han unification

Han unification is (and was) a politically charged thing, especially as Taiwan and the PRC use different character sets (Traditional versus simplified Chinese). For more information and references on the whole issue:

UnicodeIsNotSimple written at 17:32:12; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.