2011-06-04
Sometimes having a system programmer around is the right answer
In SystemProgrammerDanger, I wrote about the danger of having a system programmer around (to summarize, we automatically reach for things like the source code). But there's a flipside of that, handily illustrated by my recent entry on getting stale file errors for local files; sometimes the system programmer approach is the right one.
When my colleague brought up this odd issue he was having with local
files giving 'stale filehandle' errors, my first reaction was to grep
through the kernel source for places that returned ESTALE
errors; I
figured that there couldn't be too many of them, since ESTALE
is a
very specific error (I was mostly correct about this). Reading through
the code the grep found soon pointed me to the likely issue, especially
once my colleague also reported that the system had disk errors and
he'd been seeing odd stat()
results for other files. All of this
took me only a few minutes (partly because I already had kernel source
available).
I'm pretty sure that this was the fastest way I could have found the answer. And I found it by taking a system programmer's path.
(Okay, web searches do suggest that other people have run into this before and have identified it as being caused by disk corruption and sometimes fixed by fsck. This is decent operational advice but doesn't tell you what's really going on. My personal view is that knowing what's really going on is important because it gives you confidence that you've dealt with the real problem instead of just papered over a symptom.)