A thought on giving custom redundant storage systems some history
Suppose that you're building some custom storage backend that is simply too big to be backed up, so you have only redundancy; this is probably common if you're building a cloud-style environment or are otherwise dealing with a huge volume of data. This leaves you with the redundancy history problem, where you're protected against hardware failures but any mistakes are 'instantly' replicated to the redundant copies.
Suppose that you want to do better than this; you somehow want to give your redundant storage system some history without going all the way to backups.
The approach that occurs to me is to make your storage system be based around a 'copy on write' model for updates; instead of updating in place, you write new versions and change references (which seems like it would be handy for a distributed system anyways). Then instead of immediately removing unreferenced objects, you try to let them sit around for a certain amount of time (hours, days, or weeks, depending on how much extra storage you can have and what your update volume is).
What this gives you is time. If you make a mistake, you have time to panic, go digging in your datastore, and pull out the now unreferenced objects that correspond to how things used to be. Building tools to help with this ahead of time is probably recommended.
I think that this has two advantages over an actual snapshot feature. First, it has lower overhead in exchange for worse tools to access the 'snapshot' (which is a good tradeoff if you expect to make mistakes only very, very rarely). Second, you aren't restricted to looking only at the points in time where you happened to make snapshots, as you effectively make continuous snapshots.
Comments on this page:Written on 31 May 2009.