Why NFS writes to ZFS are sometimes (or often) slow
It's a relatively well known issue that writing lots of small files over NFS to a ZFS filesystem is slow, but I was surprised to discover that it had a significant slowdown even when doing large bulk streaming writes to single files. Discovering this got me curious enough to dig into things.
Like most recent filesystems, ZFS is a journaled, using what the ZFS people call the ZIL (ZFS Intent Log). Also like other journaled filesystems, ZFS has the fsync problem. So where do the syncs come from?
The first version of NFS required all writes to be synchronous, with the
server not allowed to reply to them until the data was on disk, which
was soon widely acknowledged as a terrible idea for performance. NFS
v3 fixed this by allowing asynchronous writes and introducing a new
COMMIT, to force the server to flush some of your async
writes to disk. If the server can't do this, for example because it has
rebooted and lost some of your async writes, it will tell you and it's
your obligation to resend the writes.
NFS v3 COMMITs are a form of
fsync()s, and so they force ZFS to flush
the ZIL, with the resulting performance hit. One of the times that NFS
v3 clients send a COMMIT is when you
close() a file, which is why
writing lots of small files is slow on ZFS; there's an expensive sync
after every file.
What is going on with large files is the corollary of async writes and
COMMIT: if you have not COMMITed a range of writes, the server is
free to lose them. Which means that you must be able to resend those
writes, and thus have to keep the data sitting around in your writeback
cache until you get a positive reply to your COMMIT. Thus, every so
often the client has to send a COMMIT to the NFS server so that it can
free up some of its writeback cache.
(Indeed, this is what I see when looking at NFS server stats; there are several hundred COMMITs over the course of writing a 10 GB file.)
All of this says nothing about whether the NFS write slowdown actually matters to you; that's something that depends on your usage patterns and what sort of performance you need. The performance I've measured in our test environment, while not stellar, is probably good enough for us.