Wandering Thoughts archives

2013-04-19

How ZFS deals with 'advanced format' disks with 4 Kb physical sectors

These days it's very hard or impossible to buy new SATA disks that don't have 4 Kb physical sectors. This makes the question of how ZFS deals with them a very interesting one and I'm afraid that the answer is 'not well'.

First, the high speed basics. All ZFS vdevs have an internal property called 'ashift' (normally visible only through zdb) that sets the fundamental block size that ZFS uses for that vdev (the actual value is the power of two of that block size; a 512 byte block size is an ashift of 9, a 4 KB one is an ashift of 12). The ashift value for a new vdev is normally set based on the physical sector sizes reported by the initial disk(s). The ashift for a vdev can't be changed after the vdev is created and since vdevs can't be detached from a pool, it's permanent after creation unless and until you destroy the pool. Linux ZFS allows you to override the normal ashift with a command line argument. Illumos ZFS only allows you to set the low-level physical block size reported for disks (see here for details) and thus indirectly control the ashift for new vdevs.

It turns out that the basic rule of what ZFS will allow and not allow is you cannot add a disk to a vdev if it has a larger physical sector size than the vdev's ashift. Note that this is the physical sector size, not the logical sector size. In concrete terms you cannot add a properly reporting 4K disk to an existing old vdev made from 512 byte disks, including replacing a 512b drive with a 4K drive. It doesn't matter to ZFS that the new 4K disk is still addressable in 512-byte sectors and it would work if ZFS didn't know it was a 4K disk; ZFS will generously save you from yourself and refuse to allow this. In practice this means that existing pools will have to be destroyed and recreated when you need to replace their current disks with 4K drives, unless you can find some way to lie to ZFS about the physical block size of the new disks.

(Sufficiently old versions of Solaris are different because they know about ashift but do not know about physical sector sizes; they only notice and know about logical sector sizes. The good news is that you can replace your 512 byte disks with 4K disks and have things not explode. The bad news is that there is no way to create new vdevs with ashift=12.)

Since a 512b to 4K transition is probably inevitable in every disk drive technology, you now want to create all new vdevs with ashift=12. A vdev created with at least one 4K drive so that it gets an ashift of 12 can thereafter freely mix 512b drives and 4K drives; as far as I know you can even replace all of the 4K drives in it with 512b drives. On Illumos the only way to do this is to set the reported physical sector size of at least one disk in the new vdev to 4K (if they aren't 4K disks already), at which point you become unable to add them to existing pools created with 512-byte disks. On old versions of Solaris (such as the Solaris 10 update 8 that we're still running) this is impossible.

(The conflicting needs for disks to report as 4K sector drives or 512b sector drives depending on what you're doing with them is why the Illumos 'solution' to this problem is flat out inadequate.)

The other issue is one of inherent default alignment in normal operation. Many current filesystems will basically align almost all of their activity on 4Kb or greater boundaries even if they think the disk has 512b sectors, which means that they'll actually be issuing aligned full block writes on 4K drives if the underlying partitions are properly aligned. Unfortunately ZFS is not one of these filesystems. Even though it normally writes a lot of data in 128 Kb records ZFS will routinely do unaligned writes (even for these 128 Kb records), including writes that start on odd (512b) block numbers. If you do mix a 4K physical sector drive into your old vdevs in one way or another this means that you'll be doing a lot of unaligned partial writes.

(The performance penalty of this will depend on your specific setup and write load.)

I'm not particularly pleased by all of this. From my perspective the ZFS developers have done a quite good job of destroying long term storage management under ZFS because as we turn over our disk stock we're going to be essentially forced to destroy and recreate terabytes of pools with all of the attendant user disruption. With more planning and flexibility on the part of ZFS this could have been a completely user-transparent non-issue. As it is, forcing us to migrate data due to a drive technology change is the exact opposite of painless long term storage management.

Disclaimer: this is primarily tested on current versions of Illumos, specifically OmniOS. It's possible that ZFS on Linux or Solaris 11 behave differently and more sensibly, allowing you to replace 512b disks with 4K disks and so on. Commentary is welcome.

(All of these bits of information are documented or semi-documented on various web pages and mailing list threads around the Internet but I couldn't find them all in one place and I couldn't find anything that definitively and explicitly documented how 4K and 512b disks interacted with vdevs with various ashift settings.)

Sidebar: what ZFS should do

Three things immediately and two over the longer range:

  • allow 4K disks with a 512b logical sector size to be added to existing ashift=9 vdevs. Possibly this should require a 'force' flag and some sort of warning message. Note that this is already possible if you make the disk lie to ZFS; the only thing this flag does is remove the need for the lies.

  • create all new vdevs with ashift=12 by default, because this is the future-proof option, and provide a flag to turn this off for people who really absolutely need to do this for some reason.

  • allow people to specify the ashift explicitly during vdev creation. Ideally there would be a pool default ashift (or the default ashift for all new vdevs in a pool should be the largest ashift on an existing vdev).

  • change the block allocator so that even on ashift=9 pools as much as possible is kept aligned on 4Kb boundaries.

  • generalize this to create a new settable vdev or pool property for the preferred alignment. This would be useful well beyond 4K disks; for example, SSDs often internally have large erase block sizes and are much happier with you if you write full blocks to them.

(Some of this work may already be going on in the ZFS world, especially things that would help SSDs.)

solaris/ZFS4KSectorDisks written at 15:10:13; Add Comment

How I want storage systems to handle disk block sizes

What I mean by a storage system here is anything that exports what look like disks through some mechanism, whether that's iSCSI, AoE, FibreChannel, a directly attached smart controller of some sort, or something I haven't heard of. As I mentioned last entry, I have some developing opinions on how these things should handle the current minefield of logical and physical block sizes.

First off, modern storage systems have no excuse for not knowing about logical block size versus physical block size. The world is no longer a simple place where all disks can be assumed to have 512 byte physical sectors and you're done. So the basic behavior is to pass through the logical and physical block sizes of the underlying disk that you're exporting. If you're exporting something aggregated together from multiple disks, you should obviously advertise the largest block size used by any part of the underlying storage.

(If the system has complex multi-layered storage it should try hard to propagate all of this information up through the layers.)

You should also provide the ability to explicitly configure what logical and physical block sizes a particular piece of storage advertises. You should allow physical block sizes to be varied up and down from their true value and for logical block sizes to be varied up (and down if you support making that work). It may not be obvious why people need all of this, so let me mention some scenarios:

  • you may want to bump the physical block size of all your storage to 4kb regardless of the actual disks used so that your filesystems et al will be ready and optimal when you start replacing your current 512 byte disks with 4kb disks. (Possibly) wasting a bit of space now beats copying terabytes of data later.

  • similarly you may be replacing 512 byte disks with 4kb disks (because they're all that you can get) but your systems really don't deal well with this so you want to lie to them about it. There are other related scenarios that I'll leave to your imagination.

  • you may want to set a 4 kb logical sector size to see how your software copes with it in various ways. Sometime in the future setting it will also be a future-proofing step (just as setting a 4 kb physical block size is today).

It would be handy if storage systems had both global and per-whatever settings for these. Global settings are both easier and less error prone for certain things; with a global setting, for example, I can make sure that I never accidentally advertise a disk as having 512 byte physical sectors.

(Why this now matters very much is the subject for a future entry.)

tech/SANAdvertisingBlocksizes written at 02:19:21; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.