2016-01-20
Illumos's ZFS prefetching has recently become less superintelligent than it used to be
Several years ago (in 2012) I wrote How ZFS file prefetching seems to work, which discussed how ZFS prefetching worked at the time. As you may have guessed from the title of this entry, things have recently changed, at least in Illumos and other things built on the open source ZFS code (which includes the very latest ZFS on Linux). The basic change is Illumos 5987 - zfs prefetch code needs work, which landed in mainstream Illumos in early September of 2015, appears to have made it into FreeBSD trunk shortly afterwards, and which made it into ZFS on Linux only in late December.
The old code detected up to 8 streams (by default) of forward and reverse reads that were either straight sequential or strided (eg 'read every fourth block'). The new code still has 8 streams, but each stream now only matches sequential forward reads. This makes ZFS prefetching much easier to avoid and makes the code much easier to follow. I suspect that it won't have much effect on real workloads, although you never know; maybe there's real code that does strided forward reads or the like.
(There is also a tunable change; zfetch_max_distance replaces
zfetch_block_cap as the limit on the amount of data that will
be prefetched for a single stream. It's in bytes and defaults to
8 MBytes.)
Unfortunately the largest single drawback of ZFS prefetching still remains: prefetching (still) doesn't notice if the data it read in gets discarded from the ARC before it could be used. Just as before, as long as you're reading sequentially from the file, it will keep prefetching more and more data. Nor do streams time out if the file hasn't been touched at all in a while; each ZFS dnode may have up to eight of them hanging around basically forever, waiting patiently to match against the next read and restart prefetching (perhaps very large prefetching, as the amount of data to be prefetched never shrinks as far as I can see).
(That streams are per dnode instead of per open file handle does help explain why ZFS wants up to eight of them, since the dnode is shared across everyone who has the file open. If multiple people have the same file open and are reading from it sequentially (perhaps in different spots), it's good if they all get prefetched.)
2016-01-11
The drawback of setting an explicit mount point for ZFS filesystems
ZFS has three ways of getting filesystems mounted and deciding where
they go in the filesystem hierarchy. As covered in the zfs manpage,
you have a choice of automatically putting the filesystem below the
pool (so that tank/example is mounted as /tank/example), setting
an explicit mount point with mountpoint=/some/where, or marking the
filesystem as 'legacy' so that you mount it yourself through whatever
means you want (usually /etc/vfstab, the legacy approach to filesystem
mounts). With either of the first two options, ZFS will automatically
mount and unmount filesystems as you import and export pools or do
various other things (and will also automatically share them over NFS if
set to do so); with the third, you're on your own to manage things.
The first approach is ZFS's default scheme and what many people
follow. However, for what is in large part historical reasons we
haven't used it; instead we've explicitly specified our mount points
with mountpoint=/some/where on our fileservers.
When I set up ZFS on Linux on my office
workstation I also set the mount points explicitly, because I was
migrating existing filesystems into ZFS and I didn't feel like
trying to change their mount points (or add another layer of bind
mounts).
For both our fileservers and my workstation, this has turned out to
sometimes be awkward. The largest problem comes if you're in the
process of moving a filesystem from one pool to another on the same
server using zfs send and zfs recv. If mountpoint was unset,
both versions of the filesystem could coexist, with one as
/oldpool/fsys and the other as /newpool/fsys. But with mountpoint
set, they both want to be mounted on the same spot and only one can
win. This means we have to be careful to use 'zfs recv -u' and even
then we have to worry a bit about reboots.
(You can set 'canmount=off' or clear the 'mountpoint' property
on the new-pool version of the filesystem for the time when the
filesystem is only part-moved, but then you have a divergence between
your received snapshot and the current state of the filesystem and
you'll have to force further incremental receives with 'zfs recv
-F. This is less than ideal, although such a divergence can happen
anyways for other reasons.)
On the other hand, there are definite advantages to not having the mount point change and for having mount points be independent of the pool the filesystem is in. There's no particular reason that either users or your backup system need to care which pool a particular filesystem is in (such as whether it's in a HD-based pool or a SSD-based one, or a mirrored pool instead of a slower but more space efficient RAIDZ one); in this world, the filesystem name is basically an abstract identifier, instead of the 'physical location' that normal ZFS provides.
(ZFS does not quite do 'physical location' as such, but the pool plus the position within the pool's filesystem hierarchy may determine a lot about stuff like what storage the data is on and what quotas are enforced. I call this the physical location for lack of a better phrase, because users usually don't care about these details or at least how they're implemented.)
On the third hand, arguably the right way to provide an 'abstract identifier' version of filesystems (if you need it) is to build another layer on top of ZFS. On Solaris, you'd probably do this through the automounter with some tool to automatically generate the mappings between logical filesystem identifiers and their current physical locations.
PS: some versions of 'zfs receive' allow you to set properties
on the received filesystem; unfortunately, neither OmniOS nor ZFS
on Linux currently support that. I also suspect that doing this
creates the same divergence between received snapshot and received
filesystem that setting the properties by hand does, and you're
back to forcing incremental receives with 'zfs recv -F' (and
re-setting the properties and so on).
(It's sort of a pity that canmount is not inherited, because
otherwise you could receive filesystems into a special 'newpool/nomount'
hierarchy that blocked mounts and then active them later by using
'zfs rename' to move them out to their final place. But alas,
no.)
2016-01-05
Illumos's problem with its VCS commit messages
Quite a number of years ago I wrote an entry on the problem with the OpenSolaris source repository, where I called out Sun for terrible commit practices. At the time I thought that the public OS source repository had to be just a series of code snapshots turned into an external repository, but someone from Sun showed up in the comments to assure me that no, the terrible commit practices really were how they worked. I am glad to say that Illumos has fixed this problem in the Illumos master repository.
Well, mostly. Illumos does not routinely bundle multiple unrelated changes together into one commit the way that Sun used to, and (unlike Sun) their bug reports and so on are clearly visible. But they still have one problem with their commits. To show you what it is, here is a typical commit message:
6434 sa_find_sizes() may compute wrong SA header size Reviewed-by: Ned Bass <...> Reviewed-by: Brian Behlendorf <...> [...] Approved by: Robert Mustacchi <...>
That is the entire commit message. To know anything more, you must know how to look up the Illumos issue associated with this. Unless you do this, or are sufficiently knowledgeable about Illumos internals, it is probably not obvious that this is a ZFS bug; if you were scanning the commit logs to look for potentially important things for a ZFS fileserver environment, for example, this commit might not jump out at you as something you'd like.
Minimal commit messages like this are not what you'd call best practices. Pretty much everyone else has settled on a style where you at least describe a bit about the issue and the changes you're making. This lets people follow along just from the commit logs alone and provides a point in time snapshot of things; external bug reports may get updated or edited later, for example.
Beyond just the ability of people to follow the commit logs, this means that the Illumos commit history is not complete by itself. Since all the real content is in the Illumos issue tracker, the commit logs are crucially dependent on it. Lose the issue tracker (or just lose access to it) and you will be left to reconstruct scraps of understanding.
And, as far as I know, the Illumos issue tracker is not a distributed, replicated resource. There is one of it, and you cannot clone its data the way you can clone the Illumos repo itself.
(I'm sure it's backed up and there's multiple people involved. But there's still centralization here, and we've had things happen to centralized open source resources before. If nothing else, life on the Internet has taught me that almost everything shuts down sooner or later.)
At one point I thought it would be nice to at least include the URL of the Illumos issue in the commit message. I'm not sure of that any more, although I'm sure it'd help some people. It feels like a half-hearted bandaid, though. On the other hand, ZFS on Linux does put in URL references when porting Illumos changes into ZoL (eg) and I do like it, although it's a somewhat different situation.
(I don't expect this part of Illumos development culture to change. I'm sure the people doing Illumos development have heard all of these arguments before, and since they're doing what they're doing they're clearly happy with doing it their way.)