How ZFS deals with 'advanced format' disks with 4 Kb physical sectors

April 19, 2013

These days it's very hard or impossible to buy new SATA disks that don't have 4 Kb physical sectors. This makes the question of how ZFS deals with them a very interesting one and I'm afraid that the answer is 'not well'.

First, the high speed basics. All ZFS vdevs have an internal property called 'ashift' (normally visible only through zdb) that sets the fundamental block size that ZFS uses for that vdev (the actual value is the power of two of that block size; a 512 byte block size is an ashift of 9, a 4 KB one is an ashift of 12). The ashift value for a new vdev is normally set based on the physical sector sizes reported by the initial disk(s). The ashift for a vdev can't be changed after the vdev is created and since vdevs can't be detached from a pool, it's permanent after creation unless and until you destroy the pool. Linux ZFS allows you to override the normal ashift with a command line argument. Illumos ZFS only allows you to set the low-level physical block size reported for disks (see here for details) and thus indirectly control the ashift for new vdevs.

It turns out that the basic rule of what ZFS will allow and not allow is you cannot add a disk to a vdev if it has a larger physical sector size than the vdev's ashift. Note that this is the physical sector size, not the logical sector size. In concrete terms you cannot add a properly reporting 4K disk to an existing old vdev made from 512 byte disks, including replacing a 512b drive with a 4K drive. It doesn't matter to ZFS that the new 4K disk is still addressable in 512-byte sectors and it would work if ZFS didn't know it was a 4K disk; ZFS will generously save you from yourself and refuse to allow this. In practice this means that existing pools will have to be destroyed and recreated when you need to replace their current disks with 4K drives, unless you can find some way to lie to ZFS about the physical block size of the new disks.

(Sufficiently old versions of Solaris are different because they know about ashift but do not know about physical sector sizes; they only notice and know about logical sector sizes. The good news is that you can replace your 512 byte disks with 4K disks and have things not explode. The bad news is that there is no way to create new vdevs with ashift=12.)

Since a 512b to 4K transition is probably inevitable in every disk drive technology, you now want to create all new vdevs with ashift=12. A vdev created with at least one 4K drive so that it gets an ashift of 12 can thereafter freely mix 512b drives and 4K drives; as far as I know you can even replace all of the 4K drives in it with 512b drives. On Illumos the only way to do this is to set the reported physical sector size of at least one disk in the new vdev to 4K (if they aren't 4K disks already), at which point you become unable to add them to existing pools created with 512-byte disks. On old versions of Solaris (such as the Solaris 10 update 8 that we're still running) this is impossible.

(The conflicting needs for disks to report as 4K sector drives or 512b sector drives depending on what you're doing with them is why the Illumos 'solution' to this problem is flat out inadequate.)

The other issue is one of inherent default alignment in normal operation. Many current filesystems will basically align almost all of their activity on 4Kb or greater boundaries even if they think the disk has 512b sectors, which means that they'll actually be issuing aligned full block writes on 4K drives if the underlying partitions are properly aligned. Unfortunately ZFS is not one of these filesystems. Even though it normally writes a lot of data in 128 Kb records ZFS will routinely do unaligned writes (even for these 128 Kb records), including writes that start on odd (512b) block numbers. If you do mix a 4K physical sector drive into your old vdevs in one way or another this means that you'll be doing a lot of unaligned partial writes.

(The performance penalty of this will depend on your specific setup and write load.)

I'm not particularly pleased by all of this. From my perspective the ZFS developers have done a quite good job of destroying long term storage management under ZFS because as we turn over our disk stock we're going to be essentially forced to destroy and recreate terabytes of pools with all of the attendant user disruption. With more planning and flexibility on the part of ZFS this could have been a completely user-transparent non-issue. As it is, forcing us to migrate data due to a drive technology change is the exact opposite of painless long term storage management.

Disclaimer: this is primarily tested on current versions of Illumos, specifically OmniOS. It's possible that ZFS on Linux or Solaris 11 behave differently and more sensibly, allowing you to replace 512b disks with 4K disks and so on. Commentary is welcome.

(All of these bits of information are documented or semi-documented on various web pages and mailing list threads around the Internet but I couldn't find them all in one place and I couldn't find anything that definitively and explicitly documented how 4K and 512b disks interacted with vdevs with various ashift settings.)

Sidebar: what ZFS should do

Three things immediately and two over the longer range:

  • allow 4K disks with a 512b logical sector size to be added to existing ashift=9 vdevs. Possibly this should require a 'force' flag and some sort of warning message. Note that this is already possible if you make the disk lie to ZFS; the only thing this flag does is remove the need for the lies.

  • create all new vdevs with ashift=12 by default, because this is the future-proof option, and provide a flag to turn this off for people who really absolutely need to do this for some reason.

  • allow people to specify the ashift explicitly during vdev creation. Ideally there would be a pool default ashift (or the default ashift for all new vdevs in a pool should be the largest ashift on an existing vdev).

  • change the block allocator so that even on ashift=9 pools as much as possible is kept aligned on 4Kb boundaries.

  • generalize this to create a new settable vdev or pool property for the preferred alignment. This would be useful well beyond 4K disks; for example, SSDs often internally have large erase block sizes and are much happier with you if you write full blocks to them.

(Some of this work may already be going on in the ZFS world, especially things that would help SSDs.)

Comments on this page:

From at 2013-05-07 13:18:32:

It is a completely different story on Solaris 11.1. It does put a partition table so it is aligned, it allows you to replace and/or attach a disk with larger sector sizes.

See bugs: 15785157 15796184 15802949

By cks at 2013-05-07 16:04:44:

Note that aligned partition tables aren't good enough in Solaris 10 ZFS to avoid unaligned IO, as far as I can tell (my S10U8 test environment appears to properly align the partitions nicely).

I'm glad that Solaris 11.1 has fixed this (in part or in whole, depending on how much they've done). Maybe the open source ZFS implementations will duplicate it up someday.

From at 2013-06-25 17:38:29:

According to current illumos ZFS implementation (as per use in FreeBSD as well):

"When adding 4KB physical sector size disks (ashift = 12) to a pool containing 512 byte physical sector disks (ashift = 9), or vice-versa, then the resulting pool contains mixed sector size top-level vdevs. ZFS functions properly with mixed-size top-level vdevs.

Note: attempting to replace disks with 512 byte physical sectors (or attach into a mirror made from such disks) with disks that only support 4KB logical sectors can fail, leading to operational issues with stocking spares."

... so almost there ;)

By cks at 2013-06-26 00:50:05:

Note that adding a new vdev (with a new 4k disk) to a ZFS pool is much different than replacing an existing 512b disk in an existing vdev with a new 4k disk (or attaching it as a mirror). The latter is what Illumos doesn't handle now and is the far more important issue.

Written on 19 April 2013.
« How I want storage systems to handle disk block sizes
Why a free SSL Certificate Authority is not horrifying »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Apr 19 15:10:13 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.