Wandering Thoughts archives

2009-05-27

Hosted servers, cloud computing, and backups

For all that we roll our eyes when people have only online backups or redundancy and suffer data loss as a result of it, I don't think it's necessarily as easy to do things the 'right' way as we make it sound, or even feasible at all in some situations.

Suppose that you are a modern, efficient Internet-based business, and so you are running your servers the smart way; instead of building your own machine room in a corner of your office space and bringing in expensive network connections, you've just got servers in someone's hosting center. Further suppose that you have a non-trivial amount of data.

So, how are you supposed to get offline backups? To do them, you need either frequent physical access to your hosting center in order to keep going in to swap disks (or tapes), or you need high bandwidth to your office in order to copy the data over to an office machine that you have convenient physical access to. Both of these options can be expensive, sometimes substantially so, which may mean that online backups are your only feasible choice.

(You can't really automate offline backups; if the backup media can be brought back online by remote control, it is not really 'offline'.)

And if you have enough data, backups probably become completely infeasible. For example, I doubt that the various cloud storage operators like Google (for Google Mail et al), Amazon S3, and so on even have online backups of the data in the cloud, much less offline ones; I suspect that they are dealing with so much data that all they have is redundant copies spread across datacenters.

(The advantage of live redundant copies is that they are immediately and directly useful, which makes them more affordable than pure backups.)

HostedBackups written at 00:00:16; Add Comment

2009-05-25

Backups versus redundancy

By now, everyone knows that redundancy, for example RAIDed disks, is not the same thing as backups (or at least I hope they do). But I think it's worthwhile to talk about why, and thus the fundamental difference between the two.

(Note that there are a lot of ways of getting redundancy, far beyond mere RAID. If you are a large operation like Google or a bank, you're interested in redundancy across systems and even data centers.)

The two are quite similar in that both protect you against physical hardware failures, which are usually the major failure mode that people worry about. Yes, you can argue that redundancy commonly provides less protection against major disasters like fires, but this is not intrinsic to redundancy, just to how it's usually set up (and how backups are best set up); it's certainly possible to do things like have your RAID mirrors physically remote, or for that matter to put your backups on a shelf in the machine room. (Or in your office in the same building.)

The crucial difference is that backups provide history and redundancy does not. This means that while redundancy protects you against hardware failures, it does not help you against mistakes. To recover from a mistakes, you need the ability to reach back in time to before the mistake was made; ie, you need history.

(History, or the lack of it, is thus the dividing line between whether you have a backup system or merely a system for (potentially delayed) redundancy.)

So, do you actually need history? That's a serious question. There are a lot of systems where the answer is no for various reasons; for example, you might already have the history in some other form. For example, consider a web server where everything is deployed from a version control system; backing up the web server instead of just giving it however much redundancy it needs is unnecessary overhead.

(The tricky issue to worry about here is that corruption is a form of 'mistake'. But you may already be taking backups of the version control system itself.)

BackupsVsRedundancy written at 22:58:15; Add Comment

2009-05-24

An interesting bit of ssh and sshd behavior

We have a ssh keypair that's used to let an automated script have very limited access to a remote system. As usual, we set up a whole host of restrictions in the target account's authorized_keys; we force a specific command, we only accept the key from the host we expect it from, and we specify the whole raft of no-* options, including no-pty. The command that gets forced for this particular keypair reads various things that it needs from standard input (ie, the script).

Recently, we wound up doing plain 'ssh login@host' as part of trying to debug a problem. My expectation was that this would behave just like the normal 'ssh login@host nominal-command' (since the command was being forced on the remote end anyways). Instead, what happened was that the connection stalled, (apparently) doing nothing; you would type at it and nothing happened. In fact, nothing even appeared (your typing wasn't echoed).

What turned out to be happening is this: ssh doesn't notice if the remote end refuses to create a pty. Instead it carries on exactly as if it was talking to a pty, so it puts the local terminal into raw mode and then sends your untranslated input to the other end (character by character). And plain 'ssh login@host' tries to do a login session, which asks for a pty, while the remote end refused to set up a pty and forced the command (instead of running any sort of shell).

When this happens, you get no visible output from your typing because ssh leaves it up to the remote end to do that in pty mode. Also, you generally get no visible reaction to what you've entered because when you hit 'return', ssh sent the raw return (as a \r), instead of the cooked newline (\n) that the other end is looking for. So in our case, the remote command thought that we were just typing a really, really long single line of input that we hadn't finished yet.

(Trivia: if you ever want to see if this is happening to you, type a Control-J; this sends \n directly. This is also useful to know if your terminal winds up in raw mode because a program crashed.)

SshNoPtyBehavior written at 23:08:27; Add Comment

2009-05-17

The crucial difference between online and offline backups

At one level, the difference between online backups and offline backups is that online backups are, well, online; you can make them and get at them without having to load tapes (or hard drives), connect your external USB hard drive, or whatever. There are two advantages of online backups; they involve little or no physical shuffling around of things, and they make for very rapid restores of data.

These advantages should not be understated. It's much easier to automate your backups and make sure that they happen all the time, even weekends and holidays and when you are insanely busy, if they don't require anyone to actually do anything physical. And making restores fast and easy keeps them from draining valuable staff time, especially if the most common restore request is for just a small amount of data.

(With large restores, most of the time can be taken up with writing data back to disk. But with small restores that write only a little bit of data to disk, almost all of the time goes to overhead instead, so reducing the overhead can make a drastic difference.)

However, the crucial difference between online backups and offline ones is that online backups can easily be destroyed, whether by accident or malice. By contrast, destroying offline backups takes actual physical work and is much harder to do by accident (although not impossible). It's thus a good idea to have at least some offline backups, just in case, even if online backups are so much more easy and convenient.

OnlineVsOfflineBackups written at 00:44:45; Add Comment

2009-05-10

What affects how fast you can restore backups

I was asked today if I thought that a disk-based backup system could do restores faster than a tape-based system. My best answer was a 'maybe', because it really depends on what the limiting factor is in your restores. Let's look at all of the things that have to happen in a restore:

  1. you find and load the tape into the tape drive or tape library, if it's not already there.
  2. if you have a tape library, it loads the tape into the tape drive.
  3. position the tape to the right dump image.
  4. read through the dump image until you get to the data that you want to restore.
  5. transport the data over the network to the target system that you're restoring on.
  6. write the data out on the target system.

Of these activities, a disk-based backup system makes the second and third basically instantaneous and may speed up reading the dump image from the media. It can't do anything about the speed of the network or how fast the target system can write things to disk, or how long you take to find and retrieve the right media.

It's also worth noting that your backup system can make a difference in this, depending on what the limiting factors are. For example, Amanda runs the restore command on the target system, which means that step four requires transporting the entire dump image over the network to the target system.

And speaking of media read speeds, one advantage disks have is that it is less tricky to get good read speeds. Because of mechanical issues, tape drives often have a minimum read speed that's necessary to get good performance; if you drop below that speed, the read performance goes significantly down because the tape drive has to stop and restart all the time (known as shoe-shining).

BackupRestoreSpeeds written at 23:43:32; Add Comment

2009-05-09

Our disk-based backup system

Our solution to the tape backup cost problem has been to move to a disk-based backup system. The most important enabling thing for this is that when we built the latest version of our iSCSI backends, we discovered that we could get a pretty nice 12-bay ESATA-based enclosure for reasonably cheap, and it even has 'tray-less' drive bays that could be hotswapped.

(Tray-less drive bays are drive bays where you simply slide the bare drives in and out; you do not need to mount them in some sort of a carrier or a tray beforehand.)

From the right angle, a swappable disk is essentially the same as a tape, which makes a 12-bay enclosure essentially the same as a tape library (except much cheaper). 1 TB SATA disks are probably more expensive on a cost per gigabyte basis than LTO tapes, but they're not so much more expensive to make this infeasible (at the medium scale that we operate at). And with only a bit of persuasion, Amanda is happy to treat filesystems on disks as tapes.

(In some ways it is too happy to treat them as tapes; it turns out that you have to give Amanda a staging disk in order to get parallel dumps, even when the dumps are going to disk anyways.)

So that is our current disk-based backup system. Each backup server is essentially an iSCSI backend that is running different software: a 1U server, with one system disk and one Amanda staging disk, connected to a 12-bay ESATA enclosure that's loaded with 1 TB SATA drives. Each of the 1 TB drives is divided up into three logical 'tapes' (to reduce the amount of space wasted when we have a slow dump day), and Amanda cycles through them exactly as if they were real physical tapes. Periodically we exhaust the 'tapes' loaded into the enclosure, so we pull the oldest hard drives out, set them aside, and stick more in.

(We found a source for nice plastic 3.5" HD carry cases, so we don't have to stack the bare drives up. They look something like VHS tape cases.)

Having what is effectively a tape library gives us a significant boost in backup capacity all by itself (plus it gives us much better coverage during weekends and holidays). If we need still more we can add another backup server for what I believe works out to be at most half the cost of a tape drive alone.

We're keeping our existing tape based backup system for periodic long term archival backups. After all, we have a lot of perfectly good LTO tapes, and they're probably more durable over several years than SATA HDs. (Long term durability of SATA HDs is not a concern, since with daily backups a given disk will get reused within at most a few months.)

(Necessary disclaimer: this backup system is the hard work of a number of people here, and my contribution was relatively small.)

Sidebar: the disk replacement schedule

If you are nervous about disaster recovery, you will want to pull and replace disks as soon as they're (fully) used, so that you can move them to your immediate offsite location, and you may want to accept wasted space and use only one 'tape' per disk. Locally, we accept the greater potential loss in a disaster like a machine room fire in exchange for faster and easier restores of recently deleted files.

DiskBackupSystem written at 02:28:14; Add Comment

By day for May 2009: 9 10 17 24 25 27; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.