Why you should ratelimit messages that outside things can cause

March 20, 2008

Modern versions of NFS have a variety of authentication methods, and so one of the errors that a NFS server can give a client is 'your authentication method is too weak'; for example, the client could be sending plain old Unix UIDs and GIDs to a server that requires Kerberos to get strong distributed filesystem authentication. When this happens, the Linux kernel helpfully prints an error message about it:

call_verify: server somehost.cs requires stronger authentication.

In fact, it prints this message every time it gets an RPC reply with this error. (Some of you are wincing already.)

Our current NFS servers are creaky old Solaris 8 machines. One part of that creakiness is that every so often the kernel loses its mind and decides that some or all clients aren't using a strong enough authentication method to talk to some or all filesystems. When this happens, all NFS IO from the affected clients to the affected filesystems suddenly gets 'authentication too weak' errors.

If we are unlucky, this IO is being done by something active that doesn't notice IO errors. When this happens the machine is basically dead, almost entirely consumed by dumping this message to the console over and over again as fast as the console can print, and it is time for a magic SysRq reboot because nothing else works.

(We've lost more than one major server to this. It's not fun.)

I expect that with a properly behaving NFS server, you'd get this error once at mount time and the mount would fail. But as my example illustrates, you can't count on the outside world to work properly all the time, and that is exactly why you should rate-limit error messages that can be produced by the outside world.

Note that this doesn't just apply to the kernel, and it applies even if you are dumping messages to syslog. While syslogd will do rate-limiting of a sort, you and it will burn a bunch of CPU in the process.

(Yes, I'm going to try to report this to the Linux NFS people; if I can, I'll even try to create a patch. Unfortunately it probably won't help us, because we're running Ubuntu 6.06 and the Ubuntu people will probably not accept or backport such a specialized fix.)

Written on 20 March 2008.
« The problem of charging for things (well, one of them)
Journaling filesystems and the fsync() problem »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Mar 20 23:12:29 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.