2013-10-21
NFS's problem with (concurrent) writes
If you hang around distributed filesystem developers, you may hear them say grumpy things about NFS's handling of concurrent writes and writes in general. If you're an outsider this can be a little bit opaque. I didn't fully remember the details until I was reminded about them recently so in my usual tradition I am going to write down the core problem. To start with I should say that the core problem is with NFS the protocol, not any particular implementation.
Suppose that you have two processes, A and B. A is writing to a file and
B is reading from it (perhaps they are cooperating database processes or
something). If A and B are running on the same machine, the moment that
A calls write()
the newly-written data is visible to B when it next
does a read()
(or it's directly visible if B has the file mmap()
'd).
Now we put A and B on different machines, sharing access to the file
over NFS. Suddenly we have a problem, or actually two problems.
First, NFS is silent on how long A's kernel can hold on to the write()
before sending it to the NFS server. If A close()
s or fsync()
s
the file the kernel must ship the writes off to the NFS server, but
before then it may hang on to them for some amount of time at its
convenience. Second, NFS has no protocol for the server to notify B's
kernel that there is updated data in the file. Instead B's kernel may
be holding on to what is now old cached data that it will quietly give
to B, even though the server has new data. Properly functioning NFS
clients check for this when you open()
a file (and discard old data if
necessary); I believe that they may check at other times but it's not
necessarily guaranteed.
The CS way of putting this is that this is a distributed cache invalidation problem and NFS has only very basic support for it. Basically NFS punts and tells you to use higher-level mechanisms to make this work, mechanisms that mean A and B have to be at least a bit NFS-aware. Many modern distributed and cluster filesystems have much more robust support that guarantees processes A and B see a result much closer to what they would if they ran on the same machine (some distributed FSes probably guarantee that it's basically equivalent).
(Apparently one term of art for this is that NFS has only 'close to open' consistency, ie you only get consistent results among a pool of clients if A closes the file before B opens it.)
Thinking about how I want to test disk IO on an iSCSI backend
We're in the process of renewing the hardware for our fileserver infrastructure and we've just got in the evaluation unit for what we hope will be the new backend hardware. One important part of evaluating it will be assessing how it does disk IO, so this entry is me thinking out loud about a high level view of what I want to test there.
In general we need to find out two things: how well does the hardware perform and whether it explodes under high load or other abnormal conditions. Since the disks will ultimately be exported as iSCSI targets, the local filesystem performance is broadly uninteresting; I might as well test with raw disk access when possible to remove filesystem level effects.
In performance tests:
- Streaming read and write bandwidth to an individual drive. I should
test drives in all slots of the enclosure to check for slot-dependent
performance impacts. (Ideally with the same drive, but that may be too
much manual work.)
- Streaming read and write bandwidth to multiple drives at once. What
aggregate performance can we get, where does it seem to level
off, and what are the limiting factors? I would expect at least
part of this to correlate with controller topology; since we have
two controllers in the system, I should also make sure that they
perform more or less the same.
- Single-drive random IOPS rates, then how the IOPS rate scales up to multiple drives being driven simultaneously. In theory I may need some SSDs to really test this but on the other hand we don't really care what the real limit is if we can drive all of the HDs at their full IOPs rate.
I should probably also test for an IOPS decay on one drive when other drives are being driven at full speed with streaming IO, in case there are controller limits (or OS limits) in effect there.
The above tests should also test for the system continuing to work right in the face of basic high IO load, but there are other aspects of this. For proper functionality, what I can think of now is:
- Basic hotplugging of drives. Both inserted and removed drives should
be recognized, and promptly. Insertion and removal of multiple drives
should work.
- Test the effects of hotplugging drives on IO being done to other drives
at the same time. The hardware topology involved should (I believe) make
this a non-issue but we want to test this.
- Test the effects of flushing write caches under high load, both to the
same disk and to other disks. Again this should be a non-issue, but,
well, 'should' is the important word here.
- As a trivial test, make sure that a fully dead disk doesn't cause any
controller problems.
- Test how the controller behaves for a 'failing but not dead' disk, one that gives erratic results or read errors or both. IO to other disks should continue working without problems while we should get clear errors on the affected disk.
(I think that we have such failing disks around to test with, since we've had a run of failures and semi-failures recently. Hopefully I can find a properly broken disk without too much work.)
I'm probably missing some useful things that I'll come up with later, but just writing this list down now has made me realize that I want to do tests with a failing disk.
(Note that this is not testing network level things or how the iSCSI software will work on this hardware and so on. That will get tested later; for the first pass I'm interested only in the low-level disk performance because everything else depends on that.)
Sidebar: the rough test hardware
This evaluation unit has 8x SAS on the motherboard and we've added another 8x SAS via an LSI board (I don't have the exact model number handy right now). The (data) disks are 7200 RPM 2TB SATA HDs, directly connected without a SAS expander. One obvious choke point is the single PCIE board with 8 drives on it; another one may be how the motherboard SAS ports are connected up. This time around I should actually work out the PCIE bandwidth limits as best I can (well, assuming that eight drives going at once deliver less than the expected full bandwidth).
(The system disks are separate from all of this.)