All of our important machines are pets and special snowflakes

February 5, 2015

One of the general devops mantras that I've seen go around is the pets versus cattle metaphor for servers (eg); pets are lovingly curated servers that you care about individually, while cattle are a mass herd where you don't really care about any single member. My perception is that a lot of current best practices are focused on dealing with cattle and converting pets into cattle. Unfortunately this leaves me feeling relatively detached from these practices because essentially all of our important machines are pets and are always going to stay that way.

This is not particularly because of how we manage them or even how we think of them. Instead it is because in our environment, people directly use specific individual machines on a continuous basis. When you log into comps3 and run your big compute job on it, you care very much if it suddenly shuts down on you. We can't get around this by creating, say, a Hadoop cluster, because a large part of our job is specifically providing general purpose computing to a population of people who will use our machines in unpredictable ways. We have no mandate to squeeze people down to using only services that we can implement in some generic, distributed way (and any attempt to move in that direction would see a violent counter-reaction from people).

We do have a few services that could be generic, such as IMAP. However in practice our usage is sufficiently low that implementing these services as true cattle is vast overkill and would add significant overhead to how we operate.

(Someday this may be different. I can imagine a world where some container and hosting system have become the dominant way that software is packaged and consumed; in that world we'd have an IMAP server container that we'd drop into a generic physical server infrastructure and we could probably easily also have a load balancer or something that distributed sessions to multiple IMAP server containers. But we're not anywhere near that level today.)

Similarly, backend services such as our fileservers are in effect all pets. It matters very much whether or not fileserver <X> is up and running happily, because that fileserver is the only source of a certain amount of our files. I'm not convinced it's possible to work around this while providing POSIX compatible filesystems with acceptable performance, but if it is it's beyond our budget to build the amount of redundancy necessary to make things into true cattle where the failure of any single machine would be a 'no big deal' thing.

(This leads into larger thoughts but that's something for another entry.)

Written on 05 February 2015.
« How our console server setup works
A thought on containerization, isolation, and deployment »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Feb 5 01:10:51 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.