How Usenet used to be a filesystem stress test

September 25, 2010

Once upon a time, there was Usenet. Wait, that's not far enough back. Once upon a time, Usenet software used the simplest, most straightforward way to store articles. Each newsgroup was a separate directory in the obvious directory hierarchy (so rec.arts.anime.misc was rec/arts/anime/misc under the news spool root directory) and each article was a file in that directory. Cross-posted articles were hardlinked between all of the newsgroup directories.

(Given that hardlinks can't cross filesystem boundaries, you may notice an assumption here. Yes, this caused problems in the not too long run.)

Once Usenet started having much volume, this design turned Usenet spool filesystems into a marvelous (or hideous) worse case stress test for filesystem code:

  • active newsgroups might have tens of thousands of articles, which meant tens of thousands of entries in a single directory. At the time when this started happening, all filesystems used linear searches through directory data when looking up names.

    (I believe but am not completely sure that Usenet was a major driving force behind the initial work on non-linear directory lookups.)

  • file creates were usually randomly distributed around these directories, partly because servers generally made no attempt to batch articles from one newsgroup together when they propagated things around.
  • file deletes were semi-random; articles might expire earlier or later than other articles in the same newsgroup for various reasons.

    (The first Usenet software did truly random file deletes; later software at least ordered the article deletions based on what directory they were in.)

  • for a long time, the files were quite small (Usenet spools often needed the inode to data ratio adjusted to create more inodes). Once alt.binaries got active, the size distribution was extremely lumpy; a bunch of small files, a lot of very large ones, and very little in the middle.

  • Usenet was effectively write-mostly random IO (at many sites, most Usenet articles were never read except by the system). Even when read IO was 'sequential' in some sense, as someone read through a bunch of articles in a single newgroups, it wasn't at the simple OS level because of the small separate files.

    (Just to trip filesystems up, there were some large files that were read sequentially.)

Really, Usenet spools had it all, especially once the alt hierarchy got rolling. Now you may have a better understanding of why I said earlier that an old-style Usenet filesystem would be a ZFS scrub worst case.

(And it is not surprising that the traditional Usenet spool format was eventually replaced by a more optimized storage format in INN.)

Written on 25 September 2010.
« Frames were never necessary for menus and tables of contents
The problems with OpenSolaris »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Sep 25 01:12:33 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.