Wandering Thoughts archives

2018-08-20

It's worth testing that obvious things actually do work

We've reached the point in putting together our future ZFS on Linux NFS fileservers where we believe we have everything built and now we're testing it to make sure that it works and to do our best to verify that there are no hidden surprises. In addition to the expected barrage of NFS client load tests and so on, my co-worker decided to verify that NFS locks worked. I would not have bothered, because of course NFS locks work, they are a well solved problem, and it has been many years since NFS locks (on Linux or elsewhere) had any chance of not working. This goes to show that my co-worker is smarter than I am, because when he actually tried it (using a little lock testing program that I wrote years ago), well:

$ ./locktest plopfile
Press <RETURN> to try to get a flock shared lock on plopfile:
Trying to get lock...
  flock lock failure: No locks available

With some digging we were able to determine that this was caused by rpc.statd not being started on our (Linux) fileserver. We're using NFS v3, which requires some extra daemons to handle aspects of the (separate) locking protocol, and presumably NFSv3 is unfashionable enough these days that systems no longer bother to start them by default.

(Perhaps I'm making excuses for Ubuntu 18.04 here.)

Had we taken the fileserver into production without discovering this, the good news is that important things like our mail system would probably have failed safe by refusing the proceed without locks. But we would certainly have had a fun debugging experience, and under more stress than we did in testing. So I'm very glad that my co-worker was carefully thorough here.

The obvious moral I take from this is that it's worth testing that the obvious things do work. The obvious things are probably not broken in general (otherwise you would hopefully have heard about it during system research and design), but there's always the possibility of setup or configuration mistakes, or that you have a sufficiently odd system that you're falling into a corner case. You may not want to test truly everything, but it's certainly worth testing important but obvious things, such as NFS locking.

(There's also the unpleasant possibility that you've wound up with some fundamental misunderstanding about how the system is designed to work. This is going to force some big changes, but it's better to find this out before you try to take your mistake into production, rather than afterward as things are exploding.)

How much and how thoroughly you test in general depends on your resources and the importance of what you're doing. Some places might find and run a test suite that verified that their new NFS fileservers were delivering full POSIX compatibility (or as much as you can on NFS in general), for example. Making a point of testing the obvious is only an issue if you're only going to do partial tests, and so you might otherwise be tempted to skip the 'it's so obvious it must work' bits in the interests of time.

You may also want to skip explicitly testing the obvious in favour of doing end to end tests that will depend on the obvious working. For example, we might set up an end to end test of mail delivery and (IMAP) mail reading, and if we had, that would almost certainly have discovered the locking issue. There are trade-offs involved in each level of testing, of course.

(The short version is that end to end testing can tell you that it works but it can't tell you why, and it can be dangerous to infer that why yourself. If you actually want a low level functionality test, do the test directly.)

Sidebar: The smoking gun symptom

The fileserver's kernel logs had a bunch of messages reporting:

lockd: cannot monitor <host>

This comes from kernel code that attempts to make an upcall to rpc.statd, which led us to look at ps to make sure that rpc.statd was there before we went digging further.

TestTheObvious written at 01:17:56; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.