How I am doing randomized read IO to avoid ZFS prefetching

October 30, 2012

If only so that I never have to carefully reinvent this code again, here is how I'm doing randomized read IO to avoid ZFS prefetching. Since ZFS prefetching is the most superintelligent form of prefetching I've ever seen, I expect that this approach would also avoid prefetching on other filesystems and OSes.

The following code is in Python and assumes you have a readat() function that does the basic read (and also does whatever time tracking and so on you want):

KB = 1024
FSBLKSIZE = (128 * KB)
READSIZE = (4 * KB)

CHUNKSIZE = (FSBLKSIZE * 2)

def readfile(fd, bytesize):
  """Do randomized reads from fd at
  offsets from 0 to more or less
  bytesize."""
  chunks = bytesize // CHUNKSIZE

  # Constant seed for repeatability.
  # This is a random number.
  random.seed(6538029369423517174L)

  # Create a list of every chunk offset.
  chunklist = range(0, chunks)

  # Shuffle the chunks into random order
  random.shuffle(chunklist)

  # Read from every chunk
  for chunk in chunklist:
    bytepos = chunk * CHUNKSIZE
    readat(fd, bytepos, READSIZE)

(Disclaimer: this is not exactly the code I'm using, which is messier in various ways, but it is the same approach.)

FSBLKSIZE is your filesystem's block size, the minimum size that the filesystem actually reads from the disk (on a file that was written sequentially). On ZFS this is the recordsize property and is usually 128 KB. We double this to create our chunk size; to avoid ever doing ordinary sequential IO (forward or backwards) we'll only ever do one read per chunk, ie we'll only ever read every second filesystem block. READSIZE is the size of actual reads we'll do. It should be equal to or less than FSBLKSIZE for obvious reasons, and I prefer it to be clearly smaller if possible.

(The one tricky bit about FSBLKSIZE is that you need to push it up if your filesystem ever helpfully decides to read several blocks at once for you. ZFS generally does not, but others may.)

Rather than repeatedly read at randomized positions until we feel that we've done enough reads, we generate a list of all (chunk) positions and then shuffle them into a random order. If your standard libraries don't have a routine for this, remember to use a high-quality algorithm. Shuffling this way insures that we never re-read the same block and that we read from every one of the chunks, doing a predictable amount of IO (depending on the basic size we tell readfile() to work on).

Since random numbers are a little bit unpredictable, you should always check the amount of prefetching that your filesystem is actually doing when you run this program. On ZFS this can be done by watching the ARC stats. Note that even unsuccessful prefetching may distort any performance numbers you get by adding extra, potentially unpredictable IO load.

This still leaves you exposed to things like track caching that are done by your hard drive(s), but it's very hard to avoid or even predict that level of caching. You just have to do enough IO and work on a broad enough range of file blocks (and thus disk blocks) that you're usually not doing IOs that are close enough hit any such caches.

Written on 30 October 2012.
« How ZFS file prefetching seems to work
Some stats and notes on relay attempts for our external mail gateway »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Oct 30 00:51:34 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.