2008-03-22
Why NFS writes to ZFS are sometimes (or often) slow
It's a relatively well known issue that writing lots of small files over NFS to a ZFS filesystem is slow, but I was surprised to discover that it had a significant slowdown even when doing large bulk streaming writes to single files. Discovering this got me curious enough to dig into things.
Like most recent filesystems, ZFS is a journaled, using what the ZFS people call the ZIL (ZFS Intent Log). Also like other journaled filesystems, ZFS has the fsync problem. So where do the syncs come from?
The first version of NFS required all writes to be synchronous, with the
server not allowed to reply to them until the data was on disk, which
was soon widely acknowledged as a terrible idea for performance. NFS
v3 fixed this by allowing asynchronous writes and introducing a new
operation, COMMIT, to force the server to flush some of your async
writes to disk. If the server can't do this, for example because it has
rebooted and lost some of your async writes, it will tell you and it's
your obligation to resend the writes.
NFS v3 COMMITs are a form of fsync()s, and so they force ZFS to flush
the ZIL, with the resulting performance hit. One of the times that NFS
v3 clients send a COMMIT is when you close() a file, which is why
writing lots of small files is slow on ZFS; there's an expensive sync
after every file.
What is going on with large files is the corollary of async writes and
COMMIT: if you have not COMMITed a range of writes, the server is
free to lose them. Which means that you must be able to resend those
writes, and thus have to keep the data sitting around in your writeback
cache until you get a positive reply to your COMMIT. Thus, every so
often the client has to send a COMMIT to the NFS server so that it can
free up some of its writeback cache.
(Indeed, this is what I see when looking at NFS server stats; there are several hundred COMMITs over the course of writing a 10 GB file.)
All of this says nothing about whether the NFS write slowdown actually matters to you; that's something that depends on your usage patterns and what sort of performance you need. The performance I've measured in our test environment, while not stellar, is probably good enough for us.
2008-03-17
Why I expect more from Solaris
One of the asides I cut from my recent entry about Solaris iSCSI support was a comment that while Linux iSCSI support also has its flaws, I expect better from Solaris. It's worthwhile to expand on why this is the case.
(It's not because we pay Sun for Solaris; after all, the university pays Red Hat too, for RHEL (which may wind up being our preferred server Linux; we are not too happy with Ubuntu 6.06 LTS).)
Fundamentally, it is because we have no compelling inherent reason to run Solaris instead of Linux. Thus, if Solaris is merely on par with Linux, there is no reason for us to use it; if we are going to use Solaris, it needs to exceed Linux.
(This is especially the case because there are a number of areas where Solaris is inferior to Linux, for example in driver support. And we have compelling reasons to run Linux; for example, our users are interested in various bits of software that runs first and best on Linux.)
You can argue that we could and should use Solaris even if it is on par with Linux, since there are some areas where it handily beats Linux. Unfortunately, running an additional operating system is non-trivial extra work, and the areas that Solaris beats Linux in only really count if they address problems that we're encountering.
2008-03-13
Another problem with iSCSI on Solaris 10
In addition to my earlier issues, here's a significant problem I've run into with how Solaris does iSCSI: there's no good way to get Solaris 10 to re-probe an iSCSI connection, especially if something goes wrong (for example, Solaris loses its connection to a target for long enough to give up on it, which takes less time than you might think).
There is no explicit command to restart or re-probe a specific target
connection, nor does it happen implicitly if you run devfsadm. This
leaves doing it by side effect, of which there are four approaches:
- often you can make Solaris do this by redundantly enabling the
already-enabled appropriate target discovery method (for example,
'
iscsiadm modify discovery -s enable' if you're using static configuration). However, this hasn't always worked for me.(Under at least some circumstances this will also pick up new LUNs and removed LUNs on existing targets.)
- if you are using static configuration, you can
iscsiadm removethe specific target and theniscsiadm addit back again. If you are using SendTarget or iSNS discovery and you get multiple targets from a single discovery address, you are out of luck, since removing the discovery address will log you out of all targets found through that address. - you can disable and then enable the entire discovery method.
- you can throw up your hands and reboot the machine.
Reading between the lines of the iscsiadm manpage, the approach of redundantly re-enabling a discovery method is sort of documented. Of course, I don't really trust that documentation because it claims that disabling a discovery method has no effect on targets already discovered by that method, which is a blatant lie.
All of this leaves me rather unhappy about the state of iSCSI in Solaris 10, because in a SAN environment, good management tools should not be a badly documented afterthought, they should be a core feature.
2008-03-02
How ZFS's version of RAID-5 can be better than normal RAID-5
ZFS's 'raidz' and 'raidz2' storage methods are single and dual parity, but they are not exactly the same as normal RAID-5 and RAID-6. The difference is in how each handles partial stripe writes, such as what happens when you write only a small amount to disk.
In a conventional RAID implementation, a partial write to a stripe also has to update the parity, which means additional disk IO (at least a disk read and a disk write). Even if this disk IO doesn't delay the nominal completion of the write, it puts more activity on your disks in general, and disks only support so many IO operations a second.
By contrast ZFS effectively avoids partial stripe writes, because ZFS doesn't rewrite data in place. Even when you update an existing file, ZFS writes new data blocks for the new data, and when it writes the new data blocks it can write new parity blocks for them as well. As a corollary, ZFS doesn't have to have a fixed stripe size (or a fixed chunk size); it just has to make sure that it has enough parity blocks on each separate write.
(This does raise interesting questions of how you make sure that parity doesn't use too much of your disk space if you're doing lots of separate small writes, since such small stripes may not span all of the disks in your pool.)
ZFS can do this sort of thing because it knows what areas of the disk do and don't have data, so it can wander around doing intelligent updates. Disk-level RAID has to assume that all data blocks are live and so has to always update parity any time one of them is touched; the only saving it gets is not having to do a read-modify-write cycle if you write a full stripe.
(There are also some reliability advantages of never doing partial stripe rewrites; see Jeff Bonwick.)