It's useful to think about a 'ground up' recovery of your environment

November 15, 2022

One of the things that many system administrators don't like to think about is a total loss scenario for their entire environment. For people who run physical hardware, what you'd do if your machine room or data center had a fire; for people with virtual hardware, what you'd do if your entire cloud (short of your 'offsite backups') was wiped or deleted. Often we push this off as Disaster Recovery and then punt on it, because a real DR plan is both a lot of detailed work and also something that often doesn't survive contact with reality unless you really, really care about DR (care enough to budget for it and test your plans and so on). However, I'd like to advocate for the exercise of thinking through what it would take in your environment.

(A ground up recovery is similar to but not the same as a 'cold start'. In a cold start, you have all of the systems ready to go but none of them are running. In a ground up recovery you start with nothing except backups and have to build things up layer by layer.)

To put it simply, thinking through this scenario and perhaps testing bits of it is a great way to discover and sort out your system and service dependency graph. What you work out doesn't have to be completely correct in order for this to be useful, and you may well discover surprises in the process of charting things out. You may even want to fix some of those surprises, for example to break dependency cycles.

(Not all dependency cycles need to be broken just because of ground up recovery. Sometimes the right answer is (or would be) a temporary hack just to get your initial environment off the ground, and then you'll re-establish the dependency cycle for the full production environment.)

Of course, you can only do this in an environment where your core environment is small enough for you to chart it out and doesn't change so fast that working out the state of things today is useless, because it'll all be different in a week. This describes our modest environment but by no means all places. If you're not sure how bad your environment is here because you've never looked at it that way, well, trying to think about a ground up recovery will give you the chance to find out. One way or another you'll know more afterward.

We had to go through this exercise a while back for reasons beyond the scope of this entry. We grumbled about it a bit at the time, but I think it wound up being valuable (especially because we did do a ground up recovery test run, which showed that we more or less had the general approach right and thus confirmed our understanding of our systems).

PS: If you determine that you can't do a ground up recovery, maybe this means you've identified some critical components that absolutely can't be lost and so should be specially protected, replicated, or something similar.

Written on 15 November 2022.
« Firefox will now copy non-breaking spaces from HTML and that can be a problem
Monitoring if our wireless network is actually working in locations »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Nov 15 21:22:25 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.