2013-04-19
How ZFS deals with 'advanced format' disks with 4 Kb physical sectors
These days it's very hard or impossible to buy new SATA disks that don't have 4 Kb physical sectors. This makes the question of how ZFS deals with them a very interesting one and I'm afraid that the answer is 'not well'.
First, the high speed basics. All ZFS vdevs have an internal property
called 'ashift' (normally visible only through zdb
) that sets the
fundamental block size that ZFS uses for that vdev (the actual value
is the power of two of that block size; a 512 byte block size is an
ashift
of 9, a 4 KB one is an ashift
of 12). The ashift
value
for a new vdev is normally set based on the physical sector sizes
reported by the initial disk(s). The ashift
for a vdev can't be
changed after the vdev is created and since vdevs can't be detached
from a pool, it's permanent after creation unless and until you destroy
the pool. Linux ZFS allows you to override the normal ashift
with
a command line argument. Illumos ZFS only allows you to set the
low-level physical block size reported for disks (see here for details)
and thus indirectly control the ashift
for new vdevs.
It turns out that the basic rule of what ZFS will allow and not
allow is you cannot add a disk to a vdev if it has a larger
physical sector size than the vdev's ashift
. Note that this is
the physical sector size, not the logical sector size. In
concrete terms you cannot add a properly reporting 4K disk to an
existing old vdev made from 512 byte disks, including replacing
a 512b drive with a 4K drive. It doesn't matter to ZFS that the new
4K disk is still addressable in 512-byte sectors and it would work
if ZFS didn't know it was a 4K disk; ZFS will generously save you
from yourself and refuse to allow this. In practice this means
that existing pools will have to be destroyed and recreated when
you need to replace their current disks with 4K drives, unless you
can find some way to
lie to ZFS about the physical block size of the new disks.
(Sufficiently old versions of Solaris are different because they know
about ashift
but do not know about physical sector sizes; they only
notice and know about logical sector sizes. The good news is that you
can replace your 512 byte disks with 4K disks and have things not
explode. The bad news is that there is no way to create new vdevs with
ashift=12
.)
Since a 512b to 4K transition is probably inevitable in every disk
drive technology, you now want to create all new vdevs with
ashift=12
.
A vdev created with at
least one 4K drive so that it gets an ashift
of 12 can thereafter
freely mix 512b drives and 4K drives; as far as I know you can even
replace all of the 4K drives in it with 512b drives.
On Illumos the only way to do this is to set the
reported physical sector size of at least one disk in the new vdev
to 4K (if they aren't 4K disks already), at which point you become
unable to add them to existing pools created with 512-byte disks.
On old versions of Solaris (such as the Solaris 10 update 8 that
we're still running) this is impossible.
(The conflicting needs for disks to report as 4K sector drives or 512b sector drives depending on what you're doing with them is why the Illumos 'solution' to this problem is flat out inadequate.)
The other issue is one of inherent default alignment in normal operation. Many current filesystems will basically align almost all of their activity on 4Kb or greater boundaries even if they think the disk has 512b sectors, which means that they'll actually be issuing aligned full block writes on 4K drives if the underlying partitions are properly aligned. Unfortunately ZFS is not one of these filesystems. Even though it normally writes a lot of data in 128 Kb records ZFS will routinely do unaligned writes (even for these 128 Kb records), including writes that start on odd (512b) block numbers. If you do mix a 4K physical sector drive into your old vdevs in one way or another this means that you'll be doing a lot of unaligned partial writes.
(The performance penalty of this will depend on your specific setup and write load.)
I'm not particularly pleased by all of this. From my perspective the ZFS developers have done a quite good job of destroying long term storage management under ZFS because as we turn over our disk stock we're going to be essentially forced to destroy and recreate terabytes of pools with all of the attendant user disruption. With more planning and flexibility on the part of ZFS this could have been a completely user-transparent non-issue. As it is, forcing us to migrate data due to a drive technology change is the exact opposite of painless long term storage management.
Disclaimer: this is primarily tested on current versions of Illumos, specifically OmniOS. It's possible that ZFS on Linux or Solaris 11 behave differently and more sensibly, allowing you to replace 512b disks with 4K disks and so on. Commentary is welcome.
(All of these bits of information are documented or semi-documented on
various web pages and mailing list threads around the Internet but I
couldn't find them all in one place and I couldn't find anything that
definitively and explicitly documented how 4K and 512b disks interacted
with vdevs with various ashift
settings.)
Sidebar: what ZFS should do
Three things immediately and two over the longer range:
- allow 4K disks with a 512b logical sector size to be added to
existing
ashift=9
vdevs. Possibly this should require a 'force' flag and some sort of warning message. Note that this is already possible if you make the disk lie to ZFS; the only thing this flag does is remove the need for the lies. - create all new vdevs with
ashift=12
by default, because this is the future-proof option, and provide a flag to turn this off for people who really absolutely need to do this for some reason. - allow people to specify the
ashift
explicitly during vdev creation. Ideally there would be a pool defaultashift
(or the defaultashift
for all new vdevs in a pool should be the largestashift
on an existing vdev). - change the block allocator so that even on
ashift=9
pools as much as possible is kept aligned on 4Kb boundaries. - generalize this to create a new settable vdev or pool property for the preferred alignment. This would be useful well beyond 4K disks; for example, SSDs often internally have large erase block sizes and are much happier with you if you write full blocks to them.
(Some of this work may already be going on in the ZFS world, especially things that would help SSDs.)
How I want storage systems to handle disk block sizes
What I mean by a storage system here is anything that exports what look like disks through some mechanism, whether that's iSCSI, AoE, FibreChannel, a directly attached smart controller of some sort, or something I haven't heard of. As I mentioned last entry, I have some developing opinions on how these things should handle the current minefield of logical and physical block sizes.
First off, modern storage systems have no excuse for not knowing about logical block size versus physical block size. The world is no longer a simple place where all disks can be assumed to have 512 byte physical sectors and you're done. So the basic behavior is to pass through the logical and physical block sizes of the underlying disk that you're exporting. If you're exporting something aggregated together from multiple disks, you should obviously advertise the largest block size used by any part of the underlying storage.
(If the system has complex multi-layered storage it should try hard to propagate all of this information up through the layers.)
You should also provide the ability to explicitly configure what logical and physical block sizes a particular piece of storage advertises. You should allow physical block sizes to be varied up and down from their true value and for logical block sizes to be varied up (and down if you support making that work). It may not be obvious why people need all of this, so let me mention some scenarios:
- you may want to bump the physical block size of all your storage to
4kb regardless of the actual disks used so that your filesystems
et al will be ready and optimal when you start replacing your
current 512 byte disks with 4kb disks. (Possibly) wasting a bit
of space now beats copying terabytes of data later.
- similarly you may be replacing 512 byte disks with 4kb disks (because
they're all that you can get) but your systems really don't deal well
with this so you want to lie to them about it. There are other related
scenarios that I'll leave to your imagination.
- you may want to set a 4 kb logical sector size to see how your software copes with it in various ways. Sometime in the future setting it will also be a future-proofing step (just as setting a 4 kb physical block size is today).
It would be handy if storage systems had both global and per-whatever settings for these. Global settings are both easier and less error prone for certain things; with a global setting, for example, I can make sure that I never accidentally advertise a disk as having 512 byte physical sectors.
(Why this now matters very much is the subject for a future entry.)