2022-05-08
Checking if a machine is 'up' for scripts, well, for rsync
One of the ways that we propagate our central administrative filesystem to machines is through rsync; rather than NFS-mounting the filesystems, some machines instead use rsync to pull a subset of the filesystem to their local disk. This gives these machines resilience in case our NFS fileservers aren't available. This rsync is done (in a script) from cron, and so we'll get emailed any problems or odd output. This all works great until the rsync master machine is down, at which point we can count on a blizzard of emails from all of the rsync machines complaining that they can't talk to the master.
On the one hand, we do want to hear about specific rsync problems, so we can't just throw the rsync output and its exit status away. On the other hand, it's not useful to be told at great length that the rsync master is down; we probably already know that because our monitoring and alerting system will have told us. It would be nice to get cron rsync job email only about novel problems, and for things to be silent if the script determines that the rsync master is down.
This opens two cans of worms, both of which are tractable in our
specific case. The first can of worms is what it means for a machine
to be up. Does it ping? Does it respond to SSH? Are its services
healthy? And so on. In our specific case, the rsync is done over
SSH (of course) and our monitoring and alerting system is already
monitoring the SSH port. So we can say with confidence that if the
SSH port on the rsync master isn't responding, our rsync isn't going
to work and we're also going to get an alert about it. This means
we could use a check like what I use in my 'sshup' script, using nc
to see if a TCP connection to the SSH
port on the rsync master succeeds.
(This approach can be adopted to check if any particular port is responding, although it has to be a port that's harmless to just connect to and then drop the connection.)
The second can of worms is that for a client machine, the rsync master being down is indistinguishable from a network problem in reaching the rsync master (or the master's SSH port). What helps us here is that pretty much all of our servers are on one subnet; it would be a pretty interesting network problem that left our client server unable to talk to the rsync master but able to talk to our mail sending machine and our metrics machine (so that it looks healthy in metrics). If this was a concern, one approach is to publish a metric for whether or not the rsync was successful and then alert if any machine was unsuccessful for too long. This does require us to be collecting metrics from the machine, but we probably are.
(You're probably already alerting if you can't collect metrics from the machine.)
PS: Even if we can make a TCP connection to the rsync master's SSH port, there are a bunch of general failure modes that could stop all machines from being able to pull stuff via rsync and thus cause a blizzard of complaining emails. However, for us they're vanishingly infrequent failure modes compared to the rsync master just being down, and so we could eliminate almost all of these noise emails with a simple TCP connection check.