Using iSCSI and AOE to create artificial disk errors

July 24, 2007

One of the nice things that you can do with iSCSI and AOE is use them to test how your system (volume management, filesystem, programs, etc) really deal with low level disk errors. All sorts of interesting issues can come crawling out of the woodwork when you do this; it is very educational and occasionally rather alarming.

(Testing this sort of thing is otherwise fairly difficult, because few people have controllable error-producing hard drives sitting around, especially hard drives that will repeatedly run fine for a while and then start spewing errors. Since there are all-software implementations of iSCSI and AOE, they are much more controllable.)

In my experience, it's easier to do this with AOE than with iSCSI. Neither set of target drivers directly support this, but most AOE target drivers are small and run in user space, so it is easy to understand, modify, and run them. (My current tool of choice is something called aoedisk, which comes with the beta Solaris drivers you can get from Coraid.)

There's lots of interesting things to test:

  • turn a disk read-only
  • start returning errors on all IO, or all read IO, or all write IO.
  • start returning errors randomly, or only for requests for some sectors, or the like.
  • return corrupted data without reporting an error, either consistently or randomly.
  • change a disk's serial number, either while live or while idle, optionally zeroing the contents; this simulates swapping a physical disk (without the upper layers getting any disk changed hotswap events).

While you can also test what happens when a target device goes away or when requests start timing out, it's less useful because it's hard to be sure that the iSCSI or AOE initiator driver is behaving in the same way that the driver for the physical disk would. Of course, test away if you plan to run iSCSI or AOE, because then it's directly relevant; you may someday have a target device drop off the net or the like.

Of these, the Linux iSCSI target driver I'm familiar with can only change serial numbers and make a disk go read-only. None of the AOE tools have direct support for introducing errors after the disk has been running for a while, but it's relatively easy to add to their code. You can always force disk corruption by scribbling on bits of the backing store on the target's host, and you can always make the disk or the entire target host go away.

(Disks that fail immediately when the system tries to look at them are less interesting than disks that work long enough for the initiator to mount the filesystem and start doing IO.)

Comments on this page:

Written on 24 July 2007.
« An interesting issue when yum upgraded gaim
Solaris Volume Manager and iSCSI: a problematic interaction »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jul 24 14:59:07 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.