All of our important machines are pets and special snowflakes

February 5, 2015

One of the general devops mantras that I've seen go around is the pets versus cattle metaphor for servers (eg); pets are lovingly curated servers that you care about individually, while cattle are a mass herd where you don't really care about any single member. My perception is that a lot of current best practices are focused on dealing with cattle and converting pets into cattle. Unfortunately this leaves me feeling relatively detached from these practices because essentially all of our important machines are pets and are always going to stay that way.

This is not particularly because of how we manage them or even how we think of them. Instead it is because in our environment, people directly use specific individual machines on a continuous basis. When you log into comps3 and run your big compute job on it, you care very much if it suddenly shuts down on you. We can't get around this by creating, say, a Hadoop cluster, because a large part of our job is specifically providing general purpose computing to a population of people who will use our machines in unpredictable ways. We have no mandate to squeeze people down to using only services that we can implement in some generic, distributed way (and any attempt to move in that direction would see a violent counter-reaction from people).

We do have a few services that could be generic, such as IMAP. However in practice our usage is sufficiently low that implementing these services as true cattle is vast overkill and would add significant overhead to how we operate.

(Someday this may be different. I can imagine a world where some container and hosting system have become the dominant way that software is packaged and consumed; in that world we'd have an IMAP server container that we'd drop into a generic physical server infrastructure and we could probably easily also have a load balancer or something that distributed sessions to multiple IMAP server containers. But we're not anywhere near that level today.)

Similarly, backend services such as our fileservers are in effect all pets. It matters very much whether or not fileserver <X> is up and running happily, because that fileserver is the only source of a certain amount of our files. I'm not convinced it's possible to work around this while providing POSIX compatible filesystems with acceptable performance, but if it is it's beyond our budget to build the amount of redundancy necessary to make things into true cattle where the failure of any single machine would be a 'no big deal' thing.

(This leads into larger thoughts but that's something for another entry.)

Comments on this page:

The pets-versus-cattle discussion often focuses on N identical instances (and particularly web applications, which are as you pointed out much different from a POSIX file server), but that focus misses the point. Even your special snowflakes are probably more alike than they are different. Cattle-izing the common parts allows you to spend your effort on the parts that truly do need to be pets.

As an example, a former colleague told me about the environment he came into when he took a new job. This group maintained ~125 Unix and Linux systems with absolutely no configuration management. This lead to tiny differences in things like the way machines were listed in /etc/hosts to the point where he wrote wrapper scripts to detect which machine he was on so that various commands would do the right thing.

By cks at 2015-02-05 14:36:40:

I may be biased because of how we work, but I see the pets versus cattle distinction in terms of what you do when machines have problems or otherwise need to be taken out of service. We capture all important system state in various off-machine ways and have as much common setup as is practical, but that still doesn't mean we can do things like disruption free rolling upgrades or deal with machine problems by 'destroy and recreate'. So on a deep level I see big differences between how we manage servers and how the common best practices in DevOps are moving.

(Many machines also have local state for things like user crontabs, which we allow users to create and which we have to back up, restore, copy around on upgrades, and so on. Local state is another enemy of the 'just destroy and recreate' approach to problem solving.)

If you can easily isolate and numerate the exact places you have user state, it's not necessarily breaking the "cattle" mindset. You can (e.g.) bind-mount them all from a separate (mutable) location, and have all of your "server state" (binaries and config that you can destroy and recreate at will) separate.

But nonetheless, there are cases where "cattle" isn't (yet) the right mindset. It sounds like you might be in one; strictly splitting user data from admin data can still be a useful step to take.

By James (trs80) at 2015-02-06 23:22:55:

This is also another answer to CanWeUseCloud - the cloud is best used for cattle, not pets. Amazon EC2 reserves the right to kill your VM at any time, and you are expected to architect your application to deal with that. Which is definitely not what you want for pets. See also AWS Tips I Wish I'd Known Before I Started.

Written on 05 February 2015.
« How our console server setup works
A thought on containerization, isolation, and deployment »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Feb 5 01:10:51 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.