There's a spectrum of 'pets versus cattle' in servers

April 12, 2016

One of the memes in modern operations is that of pets versus cattle. I've written about this before, but at the time I accepted the usual more or less binary pet versus cattle split that's usually put forward. I've now shifted to feeling that there is a spectrum along the line between pets and cattle, so today I'm going to write down four spots I see on that line.

Total pets (classical pets) are artisanal servers, each one created and maintained completely by hand. You're lucky if there's any real documentation on what a machine's setup is supposed to be; there probably isn't. Losing a server probably means restoring configuration files from backups in order to get it back into service. This is the traditional level that a small or disorganized place operates at (or at least is stereotyped to operate at).

At one step along the line you have a central, global store of all configuration information and build instructions; for instance, you have the master copy of all changed configuration files in that central place, and a rule that you always modify the master version and copy it to a server. However, you build and maintain machines almost entirely by hand (although following your build documents and so on). You can recreate servers easily but they are still maintained mostly by hand, you troubleshoot them instead of reinstalling, and users will definitely notice if one suddenly vanishes. Possibly they have local state that has to be backed up and restored.

(This is how we build and maintain machines.)

Moving one more step towards cattle is when you have fully automated configuration management and more or less fully automated builds, but you still care about specific servers. You need to keep server <X> up, diagnose it when it has problems, and so on; you cannot simply deal with problems by 'terminate it and spin up another', and people will definitely notice if a given server goes down. One sign of this is that your servers have names and people know them.

Total cattle is achieved when essentially all servers can be 'fixed' by simply terminating them and spinning up another copy made from scratch, and your users won't notice this. Terminate and restart is your default troubleshooting method and you may even make servers immutable once spun up (so maintaining a server is actually 'terminate this instance and spin up an updated instance'). Certainly maintenance is automated. You never touch individual servers except in truly exceptional situations.

(Total cattle is kind of an exaggeration. Even very cattle-ish places seem to accept that there are situations where you want to troubleshoot weird problems instead of trying to assume that 'terminate and restart' can be used to fix everything.)

Written on 12 April 2016.
« Why I don't use HTTP Key Pinning and I'm not likely to any time soon
How I'm trying to do durable disk writes here on Wandering Thoughts »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Apr 12 00:25:47 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.