Another piece of my environment: running commands on multiple machines

August 5, 2014

It's my belief that every sysadmin who has a middling number of machines (more than one or two and less than a large fleet) sooner or later winds up with a set of tools to let them run commands on each of those machines or on subsets of those machines. I am no exception, and over the course of my career I have adopted, used, and built several iterations of this.

(People with large fleets of machines generally never run commands on them by hand but have some sort of automated fleet management system based around Puppet, Chef, CFEngine, Salt, Ansible, or the like. Sometimes these systems come with the ability to do this for you.)

My current iteration is based around simple shell scripts. The starting point is a shell script called machines; it prints out the names of machines that fall into various categories. The categorization of machines is entirely hand maintained, which has both problems and advantages. As a result the whole thing looks like:

mach() {
  for i in "$@" do
    case "$i" in
    apps) echo apps0 apps1 apps2 apps3 testapps;;
    ....
    ubuntu) echo `mach apps comps ...` ...;;
    ....
    *) echo $i;;
    esac
  done
}

mach "$@" | tr '\012' ' '; echo

(I put all of the work in a shell function so that I could call it recursively, for classes that are defined partly in terms of other classes. The 'ubuntu' class here is an example of that.)

So far we have few enough machines and few enough categories of machines that I'm interested in that this approach has not become unwieldy.

(There is also a script called mminus for doing subtractive set operations, so I can express 'all X machines except Y machines' or the like. This comes in handy periodically.)

The main script for actually doing things is called oneach, which does what you might think: given a list of machines it runs a command line on each of them via ssh. You can ask it to run the command in a pseudo-tty and without any special output handling, but normally it runs the command just with 'ssh machine command' and it prefixes all output with the name of the machine; you can see an example of that in this awk-based reformatting problem (an oneach run produced the input for my problem). Because I like neat formatting, oneach has an option to align the starting column of all output and I usually use that option (via a cover script called onea, because I'm lazy). The oneach script doesn't try to do anything fancy with concurrent execution or the like, it just does one ssh after the other.

Finally, I've found it useful to have another script that I call replicate. Replicate uses rsync to copy one or more files to destination machines or machine classes (it can also use scp for some obscure cases). replicate is handy for little things like pushing changes to dotfiles or scripts out to all of the machines where I have copies of them.

As a side note, machines has become a part of my dmenu environment. I use its list of machines as one of the the inputs to dmenu's autocompletion (both for normal logins and for special '@<machine>' logins as root), which makes it really quick and convenient to log into most of our machines (this large list of machines is part of the things I hide from dmenu's initial display of completion in a little UI tweak that turned out to be quite important for me).

Note that I don't necessarily suggest that you adopt my approach for running commands on your machines, which is one reason I'm not currently planning to put these scripts up in public. There are a lot of ways to solve this particular problem, many of them better and more scalable than what I have. I just think that you should get something better than manual for loops (which is what I was doing before I gave in and wrote machines, oneach, and so on).


Comments on this page:

By Lev at 2014-08-05 02:33:59:

Have you looked at pdsh? Great minds think alike - you pretty much replicated it ;-)

By Ewen McNeill at 2014-08-05 06:44:37:

FTR, ansible implements basically that functionality (including using ssh as its transport), and (AFAICT) will also dispatch jobs to machines in parallel. It uses a windows.ini file to define host groups, which is... not the worst format they could have chosen. IIRC the only thing you might have to replicate by hand is the "this group except these".

You can, if you want, build "machine should be like this" style recipes (they call them "playbooks") on top of that. But IME it's almost more useful as an ad-hoc tool without the playbooks; I found it a bit unwieldy for building non-trivial "machine should be like this" recipes, and ssh as a transport is rather slow, even reusing the same connection channel.

That said I do agree, the days of manual "for" loops should be long gone.

Ewen

By opk at 2014-08-05 09:59:27:

With a half-dozen or so machines, my favourite way to do this is with tmux's synchronize-panes feature. This makes it easier to see the output on each machine and react appropriately to any errors. For any more machines, I use dsh which is a bit like your script.

I also use one list of machines to generate configuration for dmenu, dsh and some other things. I recently redid it from shell to ruby.

By dozzie at 2014-08-05 11:30:11:

@cks:

People with large fleets of machines generally never run commands on them by hand but have some sort of automated fleet management system based around Puppet, Chef, CFEngine, Salt, Ansible, or the like.

Not quite, at least regarding first three. CFEngine and descendants are mainly intended for keeping configuration as intended, but there are cases when you want to run some command that is not a part of typical, scheduled maintenance. You want to run the command synchronously or semi-synchronously.

It's a different operation mode and serves different purpose.

By Nobody at 2014-08-05 14:47:23:

Seconding pdsh.

SIMULTANEOUSLY (not waiting for each one to complete) run commands over SSH on hundreds and hundreds of systems.

Scripting for the intermingled, line-by-line output, takes a little getting used-to, but works great.

Written on 05 August 2014.
« Why our new SAN environment is separate from our old SAN environment
Why LinkedIn's 'you must join to unsubscribe' is evil »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Aug 5 00:00:04 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.