2024-04-01
The power of being able to query your servers for unpredictable things
Today, for reasons beyond the scope of this entry, we wanted to
find out how much disk space /var/log/amanda
was using on all of
our servers. We have a quite
capable metrics system that captures the amount of space filesystems
are using (among many other things), but /var/log/amanda
wasn't
covered by this because it wasn't a separate filesystem; instead
it was just one directory tree in either the root filesystem (on
most servers) or the /var
filesystem (on a few fileservers that
have a separate /var). Fortunately we don't have too many servers
in our fleet and we have a set of tools to run commands across
all of them, so answering our question was pretty
simple.
This isn't the first time we've wanted to know some random thing about some or all of our servers, and it won't be the last time. The reality of life is that routine monitoring can't possibly capture every fact you'll ever want to know, and you shouldn't even try to make it do so (among other issues, you'd be collecting far too much information). Sooner or later you're going to need to get nearly arbitrary information from your servers, using some mechanism.
This mechanism doesn't necessarily need to be SSH, and it doesn't even need to involve connecting to servers, depending in part on how many of them you have. Perhaps you'll normally do it by peering inside one of your immutable system images to answer questions about it. But on a moderate scale my feeling is that 'run a command on some or all of our machines and give me the output' is the basic primitive you're going to wind up wanting, partly because it's so flexible.
(One advantage of using SSH for this is that SSH has a mature, well understood and thoroughly hardened authentication and access control system. Other methods of what are fundamentally remote command or code execution may not be so solid and trustworthy. And if you want to, you can aggressively constrain what a SSH connection can do through additional measures like forcing it to run in a captive environment that only permits certain things.)
PS: The direct answer is that on everything except our Amanda backup servers, /var/log/amanda is at most 20 Mbytes or so, and often a lot less. After the Amanda servers, our fileservers have the largest amount of data there. In our environment, this directory tree is only used for what are basically debugging logs, and I believe that on clients, the amount of debugging logs you wind up with scales with the number of filesystems you're dealing with.