2013-09-23
ZFS filesystem compression and quotas
ZFS filesystem compression is widely seen as basically a universally good thing (unlike deduplication); turning it on almost always gives you a clear space gain for what is generally a minor cost. Unfortunately it turns out to have an odd drawback in our environment in how it interacts with ZFS's disk quotas. Put simply, ZFS disk quotas limit the physical space consumed by a filesystem, not the logical space. In other words they limit how much post-compression disk space a filesystem can use instead of the pre-compression space. This has two drawbacks.
The first drawback is simply the user experience. In some situations writing 10 GB to a filesystem with 10 GB of quota space left will fill it up; in other situations you'll be left with a somewhat unpredictable amount of space free afterwards. Similarly if you have 10 GB free and rewrite portions of an existing file (perhaps you have a database writing and rewriting records), your free space can go down. Or up. All of this can be explained but generally not predicted and I think it's going to be at least a bit surprising to people.
(Of course these user experience problems exist even without quotas, because your pool only has so much space and how that space gets used gets unpredictable.)
The more significant problem for us is that we primarily use quotas to limit how much data we have to back up for a single filesystem. Here the space usage we care about and want to limit is actually the raw, pre-compression space usage. We don't care how much space a filesystem takes on disk, we care how much space it will take on backups (and we generally don't want to compress our backups for various reasons). Quotas based on logical space consumed would be much more useful to us than the current ZFS quotas.
(Since we have to recreate all of our pools anyways I've been thinking about whether we want to change our standard pool and filesystem configurations. My tentative conclusion is that we don't want to turn compression on, largely because of the backup issue combined with it probably not saving people significant amounts of space.)
2013-09-02
A little bit more on ZFS RAIDZ read performance
Back in this entry I talked about how all levels of ZFS RAIDZ had an unexpected read performance hit: they can't read less than a full stripe, so instead of the IOPS of N disks you get the IOPS of one disk. Well, it was recently pointed out to me that this is not quite correct. It is true that ZFS reads all of the stripe of a data block on reads; however, ZFS does not read the parity chunks (unless the block does not checksum correctly and needs to be repaired).
In normal RAIDZ pools the difference between 'all disks' and 'all disks except the parity disks' is small. If the parity for the stripes you're reading bits of are evenly spread over all of the disks, you might get somewhat more than one disk's IOPS on aggregate. Where this can matter is in very small RAIDZ pools, for example a four-disk RAIDZ2 pool. Here half your drives are parity drives for any particular data block and you may get something more like two disks of IOPS.
(A four-disk RAIDZ2 vdev is actually an interesting thing and potentially useful; it's basically a more resilient but potentially slower version of a two-vdev set of mirrors. You lose half of your disk space, as with mirroring, but you can withstand the failure of any two disks (unlike mirroring).)
To add some more RAIDZ parity trivia: RAIDZ parity is read and verified during scrubs (and thus likely resilvers), which is what you want. Data block checksums are as well of course, which means that reads on scrubs genuinely busy all drives.
Sidebar: small write blocks and read IOPS
Another way that you can theoretically get more than one disk's IOPS from a RAIDZ vdev is if the data was written in sufficiently small blocks. As I mentioned in passing here, ZFS doesn't have a fixed 'stripe size' and a small write will only put data (and parity) on less than N disks. In turn reading back this data will need less than N (minus parity) disks, meaning that if you have good luck you can read another small block from the other drives at the same time.
Since 'one sector' is the minimum amount of data to put on a single drive, this is probably much more likely now in the days of disks with 4096-byte sectors than it was on 512-byte sector drives. If you have a ten-disk RAIDZ2 on 4k disks, for example, it now takes a 32 KB data block to wind up on all 8 possible data drives.
(On 512-byte sector disks it would have only needed a 4KB data block.)
2013-08-16
SSDs may make ZFS raidz viable for general use
The classic problem and surprise with ZFS's version of RAID-5+ (raidz1, raidz2, and so on) is that you get much less read IO from your pool than most people expect. Rather than N disks worth of read IOPs you get one disk's worth for small random reads (more or less). To date this has mostly made raidz unsuitable for general use; you need to be doing relatively little random read IO or have rather low performance requirements to avoid being disappointed.
(Sequential read IO is less affected. Although I haven't tested or measured it, I believe that ZFS raidz will saturate your available disk bandwidth for predictable read patterns.)
Or rather this has made raidz unsuitable because hard drives have such low IOPs rates (generally assumed to be around 100 a second) so having only one disk's worth is terrible. But SSDs have drastically higher IOPs for reads; one SSD's worth of reads a second is still generally an impressively high number. While a raidz pool of SSDs will not have as high an IOPs rate as a bunch of mirrored SSDs, you'll get a lot more storage for your money. And a single SSD's worth of IOPs may well be enough to saturate other parts of your system (or at least more than satisfy their performance needs).
(There are other tradeoffs, of course. A raidzN will protect you from any arbitrary N disks dying, unlike mirrors, but can't protect you from a whole controller falling over the way a distributed set of mirrors can.)
This didn't even occur to me until today because I've been conditioned to shy away from raidz; I 'knew' that it performed terribly for random reads and hadn't thought through the special implications of changing raidz from HDs to SSDs. I don't think this will change our general plans (we value immunity from a single iSCSI backend failing) but it's certainly something I'm going to keep in mind in case.
A peculiar use of ZFS L2ARC that we're planning
In our SAN-based fileserver infrastructure we have a relatively small but very important and very busy pool. We need to be able to fail over this pool to another physical fileserver, so its data storage has to live on our iSCSI backends. But even with it on SSDs on the backends, going over the network with iSCSI adds latency and probably reduces bandwidth somewhat. We're not willing to move the pool to local storage on a fileserver; it's much more important that the pool stay up than that it be blindingly fast (especially since it's basically fast enough now). Oh, and it's generally much more important that reads be fast than writes.
But there is a way around this, assuming that you're willing to live with failover taking manual work (which we are): a large local L2ARC plus the regular SAN data storage. This particular pool is small enough that we basically get all of its data into an affordable L2ARC SSD (and certainly all of the active data). A local L2ARC gives us the local (read) IO for speed and effectively reduces the actual backend data storage to a persistence mechanism.
What makes this work is that a pool will import and run without its L2ARC device(s). Because L2ARC is only a cache, ZFS is willing to bring up a pool with missing L2ARC devices. If we have to fail over the pool to another fileserver it will come up without L2ARC and be slower, but at least it will come up.
(A local L2ARC plus SAN data storage works for any pool and is what we're planning in general when we renew our fileserver infrastructure (hopefully soon). But it may have limited effectiveness for large pools, based on usage patterns and so on. What makes this particular pool special is that it's small enough that the L2ARC can basically store all of it. And the L2ARC doesn't need to be mirrored or anything expensive.)
PS: given that this pool is already on SSDs, I don't think that there's any point to a separate log device. Since a SLOG is essential to the pool, it would have to live in the SAN and be mirrored; we couldn't get away with a local SLOG plus the data in the SAN.
2013-07-11
The ZFS ZIL's optimizations for data writes
In yesterday's entry on the ZIL I mentioned that the
ZIL has some clever optimizations for large write()s. To understand
these (and some related ZFS filesystem properties), let's start with the
fundamental problem.
A simple, straightforward filesystem journal simply includes a full copy
of each operation or transaction that it's recording. Many of these full
copies will be small (for metadata operations like file renames), but
for data writes you need to include the data being written. Now suppose
that you are routinely writing a lot of data and then fsync()'ing it.
This will wind up with the filesystem writing two copies of that large
data, one copy recorded with the journal and then a second copy written
to the actual live filesystem. This is inefficient and worse, it costs
you both disk seeks (between the location of the journal and the final
location of data) and write bandwidth.
Because ZFS is a copy-on-write filesystem where old data is never
overwritten in place, it can optimize this process in a straightforward
way. Rather than putting the new data into the journal it can directly
write the new data to its final (new) location in the filesystem and
then simply record that new location in the journal. However, this
is now a tradeoff; in exchange for not writing the data twice you're
forcing the journal commit to wait for a separate (and full) data write,
complete with an extra seek between the journal and the final location
of the data. For sufficiently small amounts of data this tradeoff is not
worth it and you're better off just writing an extra copy of the data to
the journal without waiting.
In ZFS, this division point is set by the global tuneable variable
zfs_immediate_write_sz. Data writes larger than this size will be
pushed directly to their final location and the ZIL will only include a
pointer to it.
Actually that's a lie. The real situation is rather more complicated.
First, if the data write is larger than the file's blocksize it is
always put into the on-disk ZIL (possibly because otherwise the ZIL
would have to record multiple pointers to its final location since it
will be split across multiple blocks, which could get complicated). Next,
you can set filesystems to have 'logbias=throughput'; such a
filesystem writes all data blocks to their final locations (among other
effects). Finally, if you have a separate log device (with a normal
logbias) data writes will always go into the log regardless of their
size, even for very large writes.
So in summary zfs_immediate_write_sz only makes a difference if you
are using logbias=latency and do not have a separate log device,
which can basically be summarized as 'if you have a normal pool without
any sort of special setup'. If you are using logbias=throughput it
is effectively 0; if you have a separate log device it is effectively
infinite.
Update (October 13 2013): It turns out that this description is not quite complete. See part 2 for an important qualification.
Sidebar: slogs and logbias=throughput
Note that there is no point in having a separate log device and setting
logbias=throughput on all of your filesystems, because the latter
makes the filesystems not use your slog. This is implicit in the
description of throughput's behavior but may not be clear enough.
'Throughput' is apparently intended for situations where you want
to preserve your slog bandwidth and latency for filesystems where
ZIL commit latency is very important; you set everything else to
logbias=throughput so that they don't touch the slog.
If you have an all-SSD pool with no slogs it may make sense to set
logbias=throughput on everything in it. Seeks are basically free on
the SSDs and you'll probably wind up with less overall bandwidth to the
SSDs used since you're writing less data. Note that I haven't measured
or researched this.
ZFS transaction groups and the ZFS Intent Log
I've just been digging around in the depths of the ZIL and of ZFS transaction groups, so before I forget everything I've figured out I'm going to write it down (partly because when I went looking I couldn't find any really detailed information on this stuff). The necessary disclaimer is that all of this is as far as I can tell from my own research and code reading and thus I could be wrong about some of it.
Let's start with transaction groups. All write operations in ZFS are part of a transaction and every transaction is part of a 'transaction group' (a TXG in the ZFS jargon). TXGs are numbered sequentially and always commit in sequential order, and there is only one open TXG at any given time. Because ZFS immediately attaches all write IO to a transaction and thus a TXG, ZFS-level write operations cannot cross each other at the TXG level; if two writes are issued in order either they are both part of the same TXG or the second write is in a later TXG (and TXGs are atomic, which is the core of ZFS's consistency guarantees).
(An excellent long discussion of how ZFS transactions work is at the start of txg.c in the ZFS source.)
ZFS also has the ZIL aka the ZFS Intent Log. The ZIL exists because
of the journaling fsync() problem:
you don't want to have to flush out a huge file just because someone
wanted to fsync() a small one (that gets you slow fsync()s and
unhappy people). Without some sort of separate log all ZFS could do to
force things to disk would be to immediately commit the entire current
transaction group, which drags all uncommitted write operations with it
whether or not they have anything to do with the file being fsync()'d.
One of the confusing things about the ZIL is that it's common to talk about 'the ZIL' when this is not really the case. Each filesystem and zvol actually has its own separate ZIL which are all written to and recovered separately from each other (although if you have separate log devices the ZILs are all normally stored on the slog devices). We also need to draw a distinction between the on-disk ZIL and the in-memory 'ZIL' structure (implicitly for a particular dataset). The on-disk ZIL has committed records while the in-memory ZIL holds records that have not yet been committed (or expired because their TXG committed). A ZIL commit is the process of taking some or all of the in-memory ZIL records and flushing them to disk.
Because ZFS doesn't know in advance what's going to be fsync()'d,
the in-memory ZIL holds a record of all write operations done to
the dataset. The ZIL has the concept of two sorts of write operations,
'synchronous' and 'asynchronous', and two sorts of ZIL commits,
general and file-specific. Sync writes are always committed when
the ZIL is committed; async writes are not committed if the ZIL is
doing a file-specific commit and they are for a different file. ZFS
metadata operations like creating or renaming files are synchronous
while data writes are generally but not always asynchronous. For
obvious reasons fsync() does a file-specific ZIL commit, as do
the other ways of forcing synchronous write IO.
If the ZIL is active for a dataset the dataset no longer has strong
write ordering properties for data that is not explicitly flushed
to disk via fsync() or the like. Because of a performance hack
for fsync() this currently extends well beyond the obvious case
of writing one file, writing a second file, and fsync()'ing the
second file; in some cases write data will be included in a ZIL
commit even though it has not been explicitly flushed.
(If you want the gory details, see: 1, 2, 3. This applies to all versions of ZFS, not just ZFS on Linux.)
ZIL records, both in memory and on disk, are completely separate from the transactions that are part of transaction groups and they're not read from either memory or disk in the process of committing a transaction group. In fact under normal operation on-disk ZIL records are never read at all. This can sometimes be a problem if you have separate ZIL log devices because nothing will notice if your log device is throwing away writes (or corrupting them) or can't actually read them back.
(I believe that pool scrubs do read the on-disk ZIL as a check but I'm not entirely sure.)
Modern versions of ZFS support a per-filesystem 'sync=' property.
What I've described above is the behavior of the 'default' setting for
it. A setting of 'always' forces a ZIL commit on every write operation
(and as a result has a strong write order guarantee). A setting of
'disabled' disables ZIL commits but not the in-memory ZIL, which
will continue to accumulate records between TXG commits and then drop
the records when a TXG commits. A filesystem with 'sync=disabled'
actually has stronger write ordering guarantees than a filesystem with
the ZIL enabled, at the cost of lying to applications about whether data
actually is solidly on disk at all (in some cases this may be okay).
(Presumably one reason for keeping the in-memory ZIL active for
sync=disabled is so that you can change this property back and have
fsync() immediately start doing the right thing.)
Under some circumstances the on-disk ZIL uses clever optimizations so
that it doesn't have to write out two copies of large write()s (one to
the ZIL log for a ZIL commit and then a second to the regular ZFS pool
data structures as part of the TXG commit). A discussion of exactly
how this works is beyond the scope of this entry,
which is already long enough as it is.
(There is a decent comment discussing some more details of the ZIL at the start of zil.c.)
2013-07-09
How we want to recover our ZFS pools from SAN outages
Last night I wrote about how I decided to sit on my hands after we had a SAN backend failure, rather than spring into sleepy action to swap in our hot spare backend. This turned out to be exactly the right decision for more than the obvious reasons.
In a SAN environment like ours it's quite possible to lose access to a whole bunch of disks without losing the disks themselves. This is what happened to us last night; the power supply on one disk shelf appears to have flaked out. We swapped out the disk shelf for another one, transplanted the disks themselves back into the new shelf, and the whole iSCSI backend was back on the air. ZFS had long since faulted all of the disks, of course (since it had spent hours being unable to talk to them), but the disks were still in their pools.
(Some RAID systems will actively eject disks from storage arrays if they are too faulted or if they disappear. ZFS doesn't do this. Those disks are in their pools until you remove them yourself.)
With the disks still in their pools, we could use 'zpool clear'
to re-activate them (it's an underdocumented side effect of clearing
errors). ZFS was smart enough to know that the disks already had
most of the pool data and just needed relatively minimal resilvering,
which is a lot faster than the full resilvering that pulling in spares
needs. Once we had the disks powered up again it took perhaps an
hour until all of the pools had their redundancy back (and part of
that time was us being cautious about IO load).
In some environments this alone might be sufficient, but we've had
prior experience that it isn't good enough;
we also need to 'zpool scrub' each pool until it reports no errors
(this is now in progress). Doing scrubs takes rather a while but
at least all the pools have (relatively) full redundancy in the
mean time.
(Part of the reason for needing to scrub our disks is that our disks probably have missing writes due to abruptly losing power.)
This sort of recovery is obviously a lot faster, less disruptive, and safer than resilvering terabytes of data by switching over to our hot spare backend (especially if we actively detach the disks from the 'failed' backend before the resilvering has finished). In the future I think we're going to want to recover failed iSCSI backends this way if at all possible. It may be somewhat more manual work (and it requires hands-on attention to swap hardware around) but it's much faster and better.
(In this specific case delaying ten hours or so probably saved us at least a couple of days of resilvering time, during which we would have had several terabytes exposed to single disk failures.)
2013-07-05
ZFS deduplication is terribly documented
One of the things that makes ZFS deduplication so dangerous and so infuriating is that it is terribly documented. My example today is what should be a simple question: does turning ZFS deduplication on irreversibly taint the pool and/or filesystem(s) involved such that you'll have performance issues even if you deleted all data, or can you later turn ZFS deduplication off and return the pool to its pre-dedup state of good performance with enough work?
You can find sources on the Internet that will give you both answers.
Oracle's own online documentation is cheerfully silent about this
(at least the full documentation does contain warnings about the
downsides of dedup, although the zfs(1)
manpage still doesn't). The only way to know for sure is to either
read kernel source or find a serious ZFS expert and ask them.
(I don't know the answer, although I'd like to.)
This should not be how you find answers to important questions about ZFS dedup. That it is demonstrates how bad the ZFS dedup documentation is, both the official Oracle documentation and most especially the Illumos manpages (because with Illumos, the manpages are mostly it).
By the way, I'm picking on ZFS dedup because ZFS dedup is both a really attractive sounding feature (who doesn't want space savings basically for free, or at least what sounds like free) and probably the single biggest way to have a terrible experience with ZFS. The current state of affairs virtually guarantees a never-ending stream of people blowing their feet off with it and leaving angry.
(The specific question here is very important if you find that dedup is causing you problems. The answer is the difference between having a reasonably graceful and gradual way out or finding yourself facing a potentially major dislocation. And if there is a graceful way out then it's much safer to experiment with dedup.)
2013-06-25
Balancing Illumos against ZFS on Linux
Every so often I poke at some aspect of our fileserver replacement project (where we need to replace our current Solaris 10 Update 8 servers with something modern enough to handle 4K sector disks), but at the moment things are moving slowly. One reason for this slowness is that I hope that things will get clearer as time goes on.
Currently I'm looking at Illumos and ZFS on Linux. With Illumos, I brought up most of our environment on an OmniOS VM and it all worked (including some tricky bits) and almost all of it worked just like Solaris. With ZFS on Linux I have the core basics up but I'm slowly chasing a NFS performance issue. And in a way this encapsulates the overall issue for me.
The risk with ZFS on Linux is issues integrating the ZFS codebase with Linux. My NFS write performance issue is clearly an issue at this join point and I have few concrete ideas for how to either troubleshoot it or to resolve it. It's probably not the only such integration issue out there and the only way to smoke them out (or to be sure that they aren't going to affect us) may be to run ZoL in production in our environment.
(I admit that that's the pessimistic view.)
The risk with Illumos is the same as it always has been: that we won't be able to find an Illumos distribution that is mature and supported for a long time, or at least not a distribution that has what we want. OmniOS has what we want and tracking it over time will tell me something about the other attributes. Not huge amounts, though, so I think I am going to have to start following some mailing lists so I can get an informed idea of how things are going.
(A project's mailing lists often give you a somewhat too pessimistic view of how healthy the project is because they often attract people with problems or gripes instead of all of the people who are happy. But seeing what the problems and gripes are is itself interesting, as is finding out what the explosive political issues are. It's just that mailing lists are time consuming and it's hard to sustain interest if you don't care about the problems, you're just there to get a sense of the land.)
In that our fileservers are going to be locked down appliances that we rarely update or even touch, my somewhat reluctant current belief is that any Illumos distribution is probably going to wind up less risky than ZFS on Linux. In practice we can have much more confidence in the core ZFS, NFS, iSCSI, and multipathing environment on Illumos because basically all of it comes from Solaris and we have plenty of experience with most of the Solaris bits. If the worst comes to the worst, lack of updates is not a huge drawback once we freeze the production system.
2013-06-11
The good and bad of IPS (as I see it)
IPS (the 'Image Packaging System') is the new packaging system used in
Solaris 11 and (more importantly) many Illumos-derived distributions; it
replaces Solaris 10 packages and patches. I have previously described IPS as being more or less like git; it puts all
files together in a hash-based content store and then has 'packages'
that are basically just indexes into the store. This contrasts with the
traditional Linux approach to packaging where each package is an archive
of some sort that contains all actual files in the package.
The attractive part of IPS is what the content store approach does for repositories and for package updates. If files are the same between two versions of a package (or between multiple packages), the repository only needs to store one copy and the package update or install process can detect that you already have the needed file installed already. This mimics the practical behavior of Solaris 10 patches, which only included changed files (as opposed to the Linux approach, where changing just one file in a package causes you to re-issue an entire second copy of the whole package).
(This also minimizes what needs to be digitally signed. Much as in git, you don't need to digitally sign the files themselves, just the package index data. The all-in-one Linux package format means that you generally need to sign and verify large blobs of data.)
The bad part of IPS is what it does to downloading and storing packages. As far as I know, files are downloaded from IPS repositories in the same way that they're stored; you ask for them one by one and they then dribble in bit by bit. As we've learned the hard way, this is not a great way to do things on the modern Internet (or in general) because each separate fetch requires a new connection (or at least a new request) and that has various consequences.
(IPS packages are normally fetched over HTTP or HTTPS but I don't know if the IPS client and server are smart enough to take advantage of HTTP connection reuse.)
I'm also not enthused about how this makes package repositories harder to manage and exposes them to subtle forms of breakage (such as a file that's listed in package manifests but not present in the repository). Pruning old packages is now necessarily a whole-repository operation, since you can't just remove their files without seeing if any other package uses them.
I suspect that Sun developed IPS this way to preserve the small sizes and small installation changes of Solaris 10 patches (which transfer and install only the changed files instead of the whole package). I prefer the simpler approach of Linux packages (and I note that Linux package updates themselves can optimize both transfer size and install time actions).