The 'cattle' model for servers is only a good fit in certain situations

March 30, 2015

To start with, let me define my terms. When I talk about 'cattle' servers, my primary definition is expendable servers that you don't need to care about when something goes wrong. A server is cattle if you can terminate it and then start a new one and be fine. A server is a pet if you actually care about it in specific staying alive.

My contention is that to have cattle servers, you either need to have a certain service delivery model or be prepared to spend a lot of money on redundancy and (HA) failover. This follows from the obvious consequence of the cattle model: in order to have a cattle model at all, people can't care what specific server they are currently getting service from. The most extreme example of not having this is when people ssh in to login or compute servers and run random commands on them; in such an environment, people care very much if their specific server goes down all of a sudden.

One way to get this server independence is to have services that can be supplied generically. For example, web pages can be delivered this way (given load balancers and so on), and it's often easy to do so. A lot of work has gone into creating backend architectures that can also be used this way (often under the goal of horizontal scalability), with multiple redundant database servers (for example) and clients that distribute DB lookups around a cluster. Large scale environments are often driven to this approach because they have no choice.

The other way to get server independence is to take what would normally be a server-dependent thing, such as NFS fileservice, and apply enough magic (via redundancy, failover, front end load balancer distribution, and so on) to turn it into something that can be supplied generically from multiple machines. In the case of NFS fileservers, instead of having a single NFS server you would create an environment with a SAN, multiple fileservers, virtual IP addresses, and transparent failover (possibly fast enough to count as 'high availability'). Sometimes this can be done genuinely transparently; sometimes this requires clients to be willing to reconnect and resume work when their existing connection is terminated (IMAP clients will generally do this, for example, so you can run them through a load balancer to a cluster of IMAP servers with shared backend storage).

(These categories somewhat overlap, of course. You usually get generic services by doing some amount of magic work to what initially were server-dependent things.)

If you only have to supply generic services or you have the money to turn server-dependent services into generic ones, the cattle model is a good fit. But if you don't, if you have less money and few or no generic services, then the cattle model is never going to fit your operations particularly well. You may well have an automated server setup and management system, but when one fileserver or login server starts being flaky the answer is probably not going to be 'terminate it and start a new instance'. In this case, you're probably going to want to invest much more in diagnostics and so on than someone in the cattle world.

(This 'no generic services' situation is pretty much our situation.)

Written on 30 March 2015.
« SSH connection sharing and erratic networks
My preliminary views on mosh »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Mar 30 01:53:04 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.