Wandering Thoughts archives

2006-08-29

An interesting filesystem corruption problem

Today we had a fun problem created by a combination of entirely rational find optimizations and a corrupted, damaged filesystem.

An important Linux server took some kind of hit that turned some files into directories (with contents, presumably stolen from some other poor directory). We found some but were pretty sure there were others lurking out there too, and wanted to do our best to find them. (If only to figure out what we needed to restore from the last good backups.)

As it happens, most of the actual files on this filesystem have some sort of extension, and pretty much all directories don't. So, I made the obvious attempt:

find /hier -name '*.*' -type d -print

Much to my surprise, this didn't report anything, not even the files we already knew about in /hier/foo/bar.

Okay, first guess: I happened to know that find optimizes directory traversals based on its knowledge of directory link counts, so if the count is off find will miss seeing directories. A quick check showed that /hier/foo/bar had the wrong link count (it only had two links, despite now having subdirectories). Usefully, find has a '-noleaf' option to turn this off (it's usually used to deal with non-Unix filesystems that don't necessarily follow this directory link count convention).

But that didn't work either. Fortunately I happened to know about the other optimization modern Unixes do for find: they have a field in directory entries called 'd_type', which has the type of the file (although not its permissions). If files had gotten corrupted into directories, it would make sense that the d_type information in their directory entries would still show their old type and make find skip them.

A quick d_type dumper program showed that this was indeed the case. This also gave us a good way to hunt these files down: walk the filesystem, looking for entries with a mismatch between d_type and what stat(2) returned.

In retrospect, I have to thank find for biting us with these optimizations; it led me to a better way to find the problem spots than I otherwise would have had.

(And writing a brute force file tree walker, even in C, turns out to be not as much work as I thought it would be.)

This is of course a great example of leaky abstractions and how knowing the low-level details can really matter. If I hadn't been well read enough about Unix geek stuff, I wouldn't have known about either find optimization and things would have been a lot more hairy. (I might have found -noleaf with sufficient study of the manpage, but that wouldn't have been enough.)

linux/FSCorruptionAndDType written at 16:36:57; Add Comment

Documentation should be cheap

Although documentation is not free, it should be cheap. By that, I mean that documentation should cost as little as possible to produce, so that you get as much of it as possible for your budget. Again, the major cost is in people's time, so you want writing documentation to be as fast (and easy) as possible.

The golden rule is that time that people are spending doing anything except writing down the actual content is overhead. You get the most bang to the buck by minimizing this overhead. And remember, the perfect is the enemy of the good.

There are two sides to this: the technical and the social. On the technical side, cheap documentation needs to be as simple to write as possible. To me, this means that it should use a simple markup language that is very close to plaintext, in a decent editor. (Web browsers do not qualify.)

(Ideally you want something where you can pretty much write basic text paragraphs and have them come out right. I think that you need some formatting, because some things really need it; ASCII art diagrams are just sad, and ASCII tables need a lot of futzing, especially if you have to revise them.)

On the social side, cheap needs a tolerance for ad-hoc things. Not everything has to be ad-hoc, but there should be room for people to just dump a couple of paragraphs in a file somewhere. Adopt the Google approach for finding things: just search everything. Then you can add more structure on top in various ways.

(In practice, many organizations use archived email lists for this purpose.)

Unfortunately, despite what I said about documentation needing testing, cheap also calls for a tolerance for various forms of inaccuracy, whether that's outright mistakes or just something that is now out of date. One way to deal with this is to have multiple levels of documentation, ranging from carefully vetted operations manuals to scribbled back of the text file notes. People can still be steered wrong, but at least they're not being mislead about how trustworthy the information is.

(I feel that the problem isn't inaccurate information, it's that people trust it too much. I even like outdated historical stuff, because it gives me useful and sometimes fascinating insights into how things evolved. But then, I'm a geek.)

There's an important secondary reason for making documentation cheap: it increases the chances that you'll be able to capture knowledge while it's still fresh in people's minds. The faster it is to write things, the more likely it is that people will have the time to write something down right after they've actually done it. (This is another reason for the popular 'send email to this mail alias to describe what you just did' approach to change documentation.)

sysadmin/DocumentationNeedsToBeCheap written at 00:04:59; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.