The power of being able to query your servers for unpredictable things

April 1, 2024

Today, for reasons beyond the scope of this entry, we wanted to find out how much disk space /var/log/amanda was using on all of our servers. We have a quite capable metrics system that captures the amount of space filesystems are using (among many other things), but /var/log/amanda wasn't covered by this because it wasn't a separate filesystem; instead it was just one directory tree in either the root filesystem (on most servers) or the /var filesystem (on a few fileservers that have a separate /var). Fortunately we don't have too many servers in our fleet and we have a set of tools to run commands across all of them, so answering our question was pretty simple.

This isn't the first time we've wanted to know some random thing about some or all of our servers, and it won't be the last time. The reality of life is that routine monitoring can't possibly capture every fact you'll ever want to know, and you shouldn't even try to make it do so (among other issues, you'd be collecting far too much information). Sooner or later you're going to need to get nearly arbitrary information from your servers, using some mechanism.

This mechanism doesn't necessarily need to be SSH, and it doesn't even need to involve connecting to servers, depending in part on how many of them you have. Perhaps you'll normally do it by peering inside one of your immutable system images to answer questions about it. But on a moderate scale my feeling is that 'run a command on some or all of our machines and give me the output' is the basic primitive you're going to wind up wanting, partly because it's so flexible.

(One advantage of using SSH for this is that SSH has a mature, well understood and thoroughly hardened authentication and access control system. Other methods of what are fundamentally remote command or code execution may not be so solid and trustworthy. And if you want to, you can aggressively constrain what a SSH connection can do through additional measures like forcing it to run in a captive environment that only permits certain things.)

PS: The direct answer is that on everything except our Amanda backup servers, /var/log/amanda is at most 20 Mbytes or so, and often a lot less. After the Amanda servers, our fileservers have the largest amount of data there. In our environment, this directory tree is only used for what are basically debugging logs, and I believe that on clients, the amount of debugging logs you wind up with scales with the number of filesystems you're dealing with.

Written on 01 April 2024.
« Some thoughts on switching daemons to be socket activated via systemd
What Prometheus Alertmanager's group_interval setting means »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 1 23:04:41 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.