My view of the difference between 'pets' and 'cattle'

March 2, 2015

A few weeks ago I wrote about how all of our important machines are pets. When I did that I did not strongly define how I view the difference between pets and cattle, partly because I thought it was obvious. Subsequent commentary in various places showed me that I was wrong about this, so now I'm going to nail things down.

To me the core distinction is not in whether you hand-build machines or have them automatically configured. Obviously when you have a large herd of cattle you cannot hand-build them, but equally obviously the current best practice is to use automated setups even for one-off machines and in small environments. Instead the real distinction is how much you care about each individual machine. In the cattle approach, any individual machine is more or less expendable. Does it have problems? Your default answer is to shoot it and start a new one (which your build automation and scaling systems should make easy). In the pet approach each individual machine is precious; if it has problems you attempt to nurse it back to health, just as you would with a loved pet, and building a new one is only a last resort even if your automation means that you can do this rapidly.

If you don't have build automation and so on, replacing any machine is a time consuming thing so you wind up with pets by default. But even if you do have fast automated builds, you can still have pets due to things like them having local state of some sort. Sure, you have backups and so on of that state, but you go to hand care because restoring a machine to full service is slower than a plain rebuild to get the software up.

(This view of pets versus cattle is supported by, eg, the discussion here. The author of that email clearly sees the distinction not in how machines are created but in significant part in how machines with problems are treated. If machines are expendable, you have cattle.)

It's my feeling that there are any number of situations where you will naturally wind up with a pet model unless you're operating at a very big scale, but that's another entry.


Comments on this page:

A useful way I found to look at this is to refer to this article and specifically Figure 4 further down the page, as popularized in TPoSaNA:

https://www.usenix.org/legacy/publications/library/proceedings/lisa97/full_papers/20.evard/20_html/main.html

Every arrow on the diagram defaults to being manual in a "Pets" scenario.

Depending on the configuration management system, each of the arrows becomes an automatic part. For example, the "Initialize", and to some extent the "Update", and "Rebuild" steps frequently become automatic to some extent.

What you're basically saying here is that, instead of doing the "Entropy/Debug" cycle as you would on "Pets", on "Cattle" you just jump straight to "Retire" or "Rebuild".

The problem with this is that it can be easy to avoid doing anything but the most basic failure analysis, so for any complex issue don't know why it failed. There's a balancing act here - how much is "knowing what went wrong" worth in these situations?

building a new one is only a last resort even if your automation means that you can do this rapidly.

I suppose my question then is "why would you spend the time nursing the pet back to health if you have the means to rebuild rapidly?" (I'm assuming that rapid rebuilding means that the appropriate data is preserved.)

There's the "figure out what went wrong so you can prevent it from happening again" answer that a previous comment alluded to. Perhaps you include that in your description of "nursing", but if not, it seems like any effort expended on restoring a machine to service after you know what went wrong (or even before if you can analyze the failure offline) is wasted.

By cks at 2015-03-02 13:53:23:

It's an issue of the (mean) time to repair and I think there's two aspects. First, if the TTR of a from scratch rebuild is non-trivial, it's worth at least some work to look at the machine. If a rebuild takes an hour and you can diagnose and fix in half an hour, you're clearly ahead by looking. Second, I think that this fundamentally changes your calculations. In a cattle environment with fast spinup, 'kill and restart' is probably your frontline problem resolution and you only think about looking at the machine if this doesn't solve things. In a slow(er) rebuild environment, 'kill and restart' is not going to be your first move because it has non-trivial costs and, as always, it might not actually solve the problem. That the rebuild is not a small thing makes it a relatively late resort instead of a first resort.

(Longer rebuilds by themselves don't make things into pets. I used to work in a setting with about a hundred lab Linux workstations and even though our automated install took some time to run it was our first troubleshooting step because it took almost no work for us to do. Those machines were very much cattle although their rebuild was quite slow by modern standards.)

Written on 02 March 2015.
« Sometimes why we have singleton machines is that failover is hard
The latest xterm versions mangle $SHELL in annoying ways »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Mar 2 00:09:17 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.