2013-04-19
How ZFS deals with 'advanced format' disks with 4 Kb physical sectors
These days it's very hard or impossible to buy new SATA disks that don't have 4 Kb physical sectors. This makes the question of how ZFS deals with them a very interesting one and I'm afraid that the answer is 'not well'.
First, the high speed basics. All ZFS vdevs have an internal property
called 'ashift' (normally visible only through zdb) that sets the
fundamental block size that ZFS uses for that vdev (the actual value
is the power of two of that block size; a 512 byte block size is an
ashift of 9, a 4 KB one is an ashift of 12). The ashift value
for a new vdev is normally set based on the physical sector sizes
reported by the initial disk(s). The ashift for a vdev can't be
changed after the vdev is created and since vdevs can't be detached
from a pool, it's permanent after creation unless and until you destroy
the pool. Linux ZFS allows you to override the normal ashift with
a command line argument. Illumos ZFS only allows you to set the
low-level physical block size reported for disks (see here for details)
and thus indirectly control the ashift for new vdevs.
It turns out that the basic rule of what ZFS will allow and not
allow is you cannot add a disk to a vdev if it has a larger
physical sector size than the vdev's ashift. Note that this is
the physical sector size, not the logical sector size. In
concrete terms you cannot add a properly reporting 4K disk to an
existing old vdev made from 512 byte disks, including replacing
a 512b drive with a 4K drive. It doesn't matter to ZFS that the new
4K disk is still addressable in 512-byte sectors and it would work
if ZFS didn't know it was a 4K disk; ZFS will generously save you
from yourself and refuse to allow this. In practice this means
that existing pools will have to be destroyed and recreated when
you need to replace their current disks with 4K drives, unless you
can find some way to
lie to ZFS about the physical block size of the new disks.
(Sufficiently old versions of Solaris are different because they know
about ashift but do not know about physical sector sizes; they only
notice and know about logical sector sizes. The good news is that you
can replace your 512 byte disks with 4K disks and have things not
explode. The bad news is that there is no way to create new vdevs with
ashift=12.)
Since a 512b to 4K transition is probably inevitable in every disk
drive technology, you now want to create all new vdevs with
ashift=12.
A vdev created with at
least one 4K drive so that it gets an ashift of 12 can thereafter
freely mix 512b drives and 4K drives; as far as I know you can even
replace all of the 4K drives in it with 512b drives.
On Illumos the only way to do this is to set the
reported physical sector size of at least one disk in the new vdev
to 4K (if they aren't 4K disks already), at which point you become
unable to add them to existing pools created with 512-byte disks.
On old versions of Solaris (such as the Solaris 10 update 8 that
we're still running) this is impossible.
(The conflicting needs for disks to report as 4K sector drives or 512b sector drives depending on what you're doing with them is why the Illumos 'solution' to this problem is flat out inadequate.)
The other issue is one of inherent default alignment in normal operation. Many current filesystems will basically align almost all of their activity on 4Kb or greater boundaries even if they think the disk has 512b sectors, which means that they'll actually be issuing aligned full block writes on 4K drives if the underlying partitions are properly aligned. Unfortunately ZFS is not one of these filesystems. Even though it normally writes a lot of data in 128 Kb records ZFS will routinely do unaligned writes (even for these 128 Kb records), including writes that start on odd (512b) block numbers. If you do mix a 4K physical sector drive into your old vdevs in one way or another this means that you'll be doing a lot of unaligned partial writes.
(The performance penalty of this will depend on your specific setup and write load.)
I'm not particularly pleased by all of this. From my perspective the ZFS developers have done a quite good job of destroying long term storage management under ZFS because as we turn over our disk stock we're going to be essentially forced to destroy and recreate terabytes of pools with all of the attendant user disruption. With more planning and flexibility on the part of ZFS this could have been a completely user-transparent non-issue. As it is, forcing us to migrate data due to a drive technology change is the exact opposite of painless long term storage management.
Disclaimer: this is primarily tested on current versions of Illumos, specifically OmniOS. It's possible that ZFS on Linux or Solaris 11 behave differently and more sensibly, allowing you to replace 512b disks with 4K disks and so on. Commentary is welcome.
(All of these bits of information are documented or semi-documented on
various web pages and mailing list threads around the Internet but I
couldn't find them all in one place and I couldn't find anything that
definitively and explicitly documented how 4K and 512b disks interacted
with vdevs with various ashift settings.)
Sidebar: what ZFS should do
Three things immediately and two over the longer range:
- allow 4K disks with a 512b logical sector size to be added to
existing
ashift=9vdevs. Possibly this should require a 'force' flag and some sort of warning message. Note that this is already possible if you make the disk lie to ZFS; the only thing this flag does is remove the need for the lies. - create all new vdevs with
ashift=12by default, because this is the future-proof option, and provide a flag to turn this off for people who really absolutely need to do this for some reason. - allow people to specify the
ashiftexplicitly during vdev creation. Ideally there would be a pool defaultashift(or the defaultashiftfor all new vdevs in a pool should be the largestashifton an existing vdev). - change the block allocator so that even on
ashift=9pools as much as possible is kept aligned on 4Kb boundaries. - generalize this to create a new settable vdev or pool property for the preferred alignment. This would be useful well beyond 4K disks; for example, SSDs often internally have large erase block sizes and are much happier with you if you write full blocks to them.
(Some of this work may already be going on in the ZFS world, especially things that would help SSDs.)
2013-04-11
Something I'd like to be easier in Solaris's IPS
IPS is the 'Image Packaging System', which seems to be essentially the default packaging system for Illumos distributions. Or at least it's the packaging system for several of them, most importantly OmniOS, and Oracle's Solaris 11, if you care about the latter. IPS is in some ways very clever and nifty but as a sysadmin there are some bits I wish it did differently, or at least easier. Particularly I wish that it made it easier to download and archive complete packages.
You may be wondering how a package system can possibly make that hard. I'm glad you asked. You see, IPS is not a traditional package system; if you want an extremely crude simplification it's more like git. In this git-like approach, the files for all packages are stored together in a hash-based content store and 'packages' are mostly just indexes of what hash identifier goes where with what permissions et al. This has various nominal advantages but also has the drawback that there is no simple package blob to download, the way there is in other packaging formats.
There are two related ways to get copies of IPS packages for yourself,
both using the low-level pkgrecv command (instead of the higher-level
pkg command). The most obvious way is to have pkgrecv just write
things out into a pkg(5) file ('pkgrecv -a -d ...'). The drawback
of this is that it really does write out everything it downloaded to
a single file. This is fine if you're just downloading one package but
it's not so great if you're using the -r switch to have pkgrecv
download a package and its dependencies. The more complex way is to
actually create your own local repo (which is a directory tree) with
'pkgrepo create /your/dir', then use pkgrecv (without -a) to
download packages into that repo. This gives you everything you want at
the cost of, well, having that repo instead of simple package files that
you can easily copy around separately and so on.
(Both pkgrecv variants also have the drawback that you have to give
them an explicit repository URL. Among other things this makes it hard
to deal with cross-repository dependencies, for example if a package
from an additional repository needs some new packages from the core
distribution repo.)
What I'd like is a high-level pkg command (or a command option) that
handled all of this complexity for me and wrote out separate pkg(5)
files for each separate package.
(In theory I could do this with a shell script if various pkg
subcommands had stable and sufficiently machine-parseable output.
I haven't looked into pkg enough to know if it does; right now
I'm at the point where I'm just poking around OmniOS.)
Sidebar: why sysadmins care about getting copies of packages
The simple answer is because sometimes we want to be able to (re)build exact copies of some system, not 'the system but with some or all of the packages updated to current versions'. We also don't want to have to depend on a remote package source staying in operation or keeping those packages around for us, because we've seen package sources go away (or decide that they need to clean up before their disk space usage explodes).
2013-04-08
Why ZFS still needs an equivalent of fsck
One of the things that is more or less a FAQ in ZFS circles is why ZFS
doesn't need an equivalent of fsck and why people asking for it are
wrong. Unfortunately, the ZFS people making that argument are, in the
end, wrong because they have not fully understood the purpose of fsck.
Fsck has two meta-purposes (as opposed to its direct purposes). The obvious one is checking and repairing
filesystem consistency when the filesystem gets itself into an
inconsistent state due to sudden power failure or the like; this is the
traditional Unix use of fsck. As lots of people will tell you, ZFS
doesn't need an external tool to do this because it is all built in.
ZFS even does traditional fsck one better in that it can safely do the
equivalent of periodic precautionary fscks in normal operation, by
scrubbing the pool.
(Our ZFS pools are scrubbed regularly and thus are far more solidly intact than traditional filesystems are.)
The less obvious meta-purpose of fsck is putting as much of
your filesystem as possible back together when things explode
badly. ZFS manifestly needs something to do this job because
there are any number of situations today where ZFS will simply throw
up its hands and say 'it sucks to be you, I'm done here'. This is
not really solvable in ZFS either, because you really can't put
this sort of serious recovery mechanisms into the normal kernel
filesystem layer; in many cases they would involve going to extreme
lengths and violating the guarantees normally provided by ZFS (cf). This means external user-level
tools.
(zdb does not qualify here because it is too low-level a tool. The
goal of fsck-level tools for disaster recovery is to give you a
relatively hands-off experience and zdb is anything but hands-off.)
PS: despite this logic I don't expect ZFS to ever get such a tool. Writing it would be a lot of work, probably would not be popular with ZFS people, and telling people 'restore from your backups' is much simpler and more popular. And if they don't have (current) backups, well, that's not ZFS's problem is it.
(As usual that is the wrong answer.)