On tabs

July 5, 2006

I rarely have violent visceral reactions to things. But I have a few hot buttons, and here's one of them. Quoting from Joel Spolsky's summary:

Nick Gravgaard: "Rather than saying that a tab character (a "hard tab") will move the cursor until the cursor's position is a multiple of N characters, we should say that a tab character is a delimiter between table cells..."

Wrong. Wrong wrong wrong. The core wrongness is visible in a single line in the original:

The solution then is to redefine how tabs are interpreted by the text editor.

The problem is that it's not just your text editor that matters.

Actual real tab characters have a well defined meaning, enforced by terminal emulators, pagers like less, many editors, your browser (in <pre> text), and programs that print things out. If you decide, like Humpty Dumpty, to make Control-I mean something else for you, you have adopted an entirely quixotic quest and I want you nowhere near anything I ever have to read.

(Because I certainly do not want to have to use your editor, assuming it is even available on my platform, to read your text in a way that makes it look decent and comprehensible.)

This need not cramp anyone's creativity, since I don't care what the tab key on your keyboard does and I don't care what mixture of characters your editor uses to embody your favorite indentation (provided it agrees with the rest of the world on what a Control-I character does). This leaves a great deal of room in the user interface, as GNU Emacs has been demonstrating for years.

(If you want to be able to edit Unix Makefiles, your editor had better be able to create and preserve real tab characters. Arguably this was a mistake way back in V7, but we're certainly stuck with it now.)

From this, you may gather that my opinion of vi's tabstop setting and the corresponding GNU Emacs tab-width variable is rather low.

Comments on this page:

From at 2011-02-14 01:24:58:

How about if we use Gravgaard's elastic tabstops idea, but using character 0x1f (unit separator) instead of 0x09 (horizontal tab)? For the sake of consistency we could then use 0x1e (record separator) instead of 0x0a (line feed), too. And 0x1d (group separator) makes more sense than the 0x0a0a (pair of line feeds) kludge currently used to separate function definitions in source code. This also sidesteps the 0x0a vs. 0x0d (carriage return) vs. 0x0d0a issue. The visual formatting markup is then fully separated from the structural markup. In source code files, the visual formatting markup characters can be omitted entirely, since they're no longer necessary.

By cks at 2011-02-14 10:54:43:

All of this shuffling of characters insures that your program text is unreadable in anything but tools that understands the meaning you assign to all of these special characters. If you want to go that far, you might as well abandon plain text as the storage format for program code.

(If you are going to do that, I think that you should also go with a tool-enforced indentation and layout standard, the way that Go has.)

From at 2011-02-14 20:58:37:

It isn't the meaning which just I am assigning to those characters, any more than tab is the meaning which just you are assigning to 0x09. They're the ASCII standard meanings, so using them no more requires abandoning plain text (ASCII) as the storage format for program code than does using 0x41 for the letter A.

Of course, using ASCII does mean that program text will be unreadable in anything but tools that understand ASCII. :)

Ok, more seriously, your original post just addressed the conflict of Gravgaard's idea with 0x09, but didn't address the actual problem which his idea is intended to solve. Is there a problem, or isn't there? Assuming you agree that there's a problem, your post at http://utcc.utoronto.ca/~cks/space/blog/unix/UnixFossilizationBad suggests that you'd agree the problem should be solved. Other than the conflict with 0x09 (which can be easily avoided by using 0x1f instead of 0x09, since nobody uses 0x1f for anything), what do you think of Gravgaard's solution? Yes, it does mean that solving the problem requires enhancing our tools, but that's not surprising. Do you think it's possible to solve this problem without enhancing our tools?

Even though the conflict with 0x09 is avoided, unenhanced tools will still fail to properly display text which uses 0x1f. Old versions of Solaris which don't support ZFS will also fail to properly list files from ZFS filesystems. Does this mean that ZFS never should have been invented, because old Solaris can't understand it? Of course, the difference is that old Solaris knows that it won't understand it, so it doesn't try. Since it doesn't try, it doesn't screw up. But unenhanced text processing tools will try to read files containing 0x1f, and they'll screw up, because they don't know that they won't understand it, because unlike the situation with disk headers, Fossilized Unix uses no type information to show whether a "plain text" file is really ASCII, or UTF-8, or Windows-1252, or ASCII-with-Gravgaard's-elastic-tabs. So, what do you really propose? Remaining in the fossilized ghetto of untyped text files, so that it's impossible to solve other problems either, such as Gravgaard's problem, without screwing stuff up?

Consider the following solution: the name of every text file should be appended with the character 0x2e, followed by the name of the type. We'll have text files named "foo.ascii", "foo.utf8", and "foo.etabs". We'll have HTML files named "foo.html.utf8", and compressed ones named "foo.html.utf8.gz". If a tool is told to process a file which it doesn't understand, it'll just say it doesn't understand, rather than screw up. Old tools which fail to honor type names will blithely read all types of files and screw up, so they'll have to be fixed, but then again, if old Solaris ignored disk headers and blithely tried to mount ZFS filesystems as UFS then it'd have to be fixed too.
Unix already uses this solution sometimes. On Debian:
$ touch foo
$ gzip foo
$ mv foo.gz foo.bar
$ gzip -d foo.bar
gzip: foo.bar: unknown suffix -- ignored

One issue with this solution is that filenames aren't passed with piped data, so if typenames are part of filenames, they'll be missing from piped data. The solution is to put the typename at the beginning of the file instead of at the end of the filename.
Unix already uses this solution sometimes. On Debian:
$ file foo.bar
foo.bar: gzip compressed data, blahblahblah
$ echo -en "\x1f\x8b" | file -
/dev/stdin: gzip compressed data

Using no type information is the worst thing to do. Typename extensions to filenames is second-best. Typename-prefixed files is best. Fossilized Unix does all three, in various cases. Fix it to always and only use the third (and list typenames at the beginning of files properly for nested types like foo.html.utf8.gz), and Gravgaard's problem can then be fixed without screwing anything up.

BTW please can we get rid of all the octal in /usr/share/file/magic?

By cks at 2011-02-14 22:42:09:

You've got a choice: you can be compatible with existing tools, or you can be incompatible. If you're going to be compatible, 0x09 has to mean what it means today. If you're going to be incompatible, I think that you might as well go all the way and give up entirely on keeping code in more or less unstructured plain text.

(There are some people who maintain that you can redefine 0x09 and still be compatible with existing tools. They are wrong; there is very little practical difference between the source code looking like unreadable hell and the source code being completely undisplayable.)

Written on 05 July 2006.
« A surprise with using object() instances
More on tabs »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jul 5 02:03:56 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.