Checking if a machine is 'up' for scripts, well, for rsync

May 8, 2022

One of the ways that we propagate our central administrative filesystem to machines is through rsync; rather than NFS-mounting the filesystems, some machines instead use rsync to pull a subset of the filesystem to their local disk. This gives these machines resilience in case our NFS fileservers aren't available. This rsync is done (in a script) from cron, and so we'll get emailed any problems or odd output. This all works great until the rsync master machine is down, at which point we can count on a blizzard of emails from all of the rsync machines complaining that they can't talk to the master.

On the one hand, we do want to hear about specific rsync problems, so we can't just throw the rsync output and its exit status away. On the other hand, it's not useful to be told at great length that the rsync master is down; we probably already know that because our monitoring and alerting system will have told us. It would be nice to get cron rsync job email only about novel problems, and for things to be silent if the script determines that the rsync master is down.

This opens two cans of worms, both of which are tractable in our specific case. The first can of worms is what it means for a machine to be up. Does it ping? Does it respond to SSH? Are its services healthy? And so on. In our specific case, the rsync is done over SSH (of course) and our monitoring and alerting system is already monitoring the SSH port. So we can say with confidence that if the SSH port on the rsync master isn't responding, our rsync isn't going to work and we're also going to get an alert about it. This means we could use a check like what I use in my 'sshup' script, using nc to see if a TCP connection to the SSH port on the rsync master succeeds.

(This approach can be adopted to check if any particular port is responding, although it has to be a port that's harmless to just connect to and then drop the connection.)

The second can of worms is that for a client machine, the rsync master being down is indistinguishable from a network problem in reaching the rsync master (or the master's SSH port). What helps us here is that pretty much all of our servers are on one subnet; it would be a pretty interesting network problem that left our client server unable to talk to the rsync master but able to talk to our mail sending machine and our metrics machine (so that it looks healthy in metrics). If this was a concern, one approach is to publish a metric for whether or not the rsync was successful and then alert if any machine was unsuccessful for too long. This does require us to be collecting metrics from the machine, but we probably are.

(You're probably already alerting if you can't collect metrics from the machine.)

PS: Even if we can make a TCP connection to the rsync master's SSH port, there are a bunch of general failure modes that could stop all machines from being able to pull stuff via rsync and thus cause a blizzard of complaining emails. However, for us they're vanishingly infrequent failure modes compared to the rsync master just being down, and so we could eliminate almost all of these noise emails with a simple TCP connection check.

Written on 08 May 2022.
« Solving a problem I had with the Unix date command in the right way
Snaps don't seem compatible with NFS home directories in Ubuntu 22.04 »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun May 8 23:24:18 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.