2007-01-09
Solving an automounter timeout problem with brute force
Our central mail machine runs various cron jobs as part of its work. Starting recently, every now and then a cron job (or a command run out of an alias) would randomly die with an error like:
sh: /cs/foo/adm/script: cannot execute
(Where /cs/foo
is NFS mounted through the automounter, and the cron
entry just runs that script.)
I am pretty sure that this is a gift from the Solaris 8 automounter.
Our central mail machine is pretty old and pokey, and we recently switched to a new method of authenticating NFS mounts that requires a ssh callback. So my operating theory is that this is the charmingly non-specific error you get when the NFS mount reply is too slow in coming and the automounter just gives up.
My current brute force solution is a little script I call 'keepmounted':
for i in $@; do
nohup sh -c "cd $i && (while :; do sleep 604800; done)" >/dev/null 2>&1 </dev/null &;
done
(The sleep value is more or less arbitrary.)
Then I just ran it for every automounted filesystem that we saw problems with and moved on to other fires. (Yes, at some point I need a better solution, but the machine is rebooted only rarely and we're working on replacing it anyways.)
(This sort of cheap hack is a surprisingly common occurrence in system administration. Sometimes a bandaid is really the best solution.)