2014-04-25
A Unix semantics issue if your filesystem can snapshot arbitrary directories
A commentator on my entry on where I think btrfs went wrong mentioned that HammerFS allows snapshots of arbitrary directories instead of requiring you to plan ahead and create subvolumes or sub-filesystems as btrfs and ZFS do. As it happens, allowing this raises a little question about Unix semantics. Let's illustrate it with some hypothetical commands:
mkdir a b touch a/fred ln a/fred b/ snapshot --readonly a@test ls -li a/fred .snap/a@test/fred rm -f b/fred ls -li a/fred .snap/a@test/fred
Here is the question: what are the inode numbers and link counts shown
for .snap/a@test/fred in the two ls's?
Clearly the real a/fred retains the same inode number and goes
from a link count of 2 to a link count of 1 after b/fred is
removed. If the snapshotted version also changes link count what
we have is observable changes in filesystem state in a theoretically
read-only snapshot, and also it's not clear how to actually implement
this. If the link count doesn't change it's actually incorrect since
you can't find anywhere the theoretical second link for
.snap/a@test/fred. But at least it's easy to implement.
(As far as filesystem boundaries go, the only sane choice is to
have .snap/a@test be a separate filesystem with a separate 'device'
and so on. That avoids massive problems around inode numbers.)
All of this is avoided if you force a to be a separate filesystem
because then you can't create this link in the first place. You're
guaranteed that all hard links in the snapshot are to things that
are also in the snapshot and so won't change.
In practice most people probably don't care about accurate link counts for files in a snapshot and this is likely just a nit. But I can't help but come up with these odd corner cases when I think about things like snapshotting arbitrary directories.
(I suspect this hard link issue is one reason that ZFS doesn't allow you to turn an existing directory hierarchy into a sub-filesystem, a restriction that sometimes annoys me.)
2014-04-19
Cross-system NFS locking and unlocking is not necessarily fast
If you're faced with a problem of coordinating reads and writes on an NFS filesystem between several machines, you may be tempted to use NFS locking to communicate between process A (on machine 1) and process B (on machine 2). The attraction of this is that all they have to do is contend for a write lock on a particular file; you don't have to write network communication code and then configure A and B to find each other.
The good news is that this works, in that cross system NFS locking and unlocking actually works right (at least most of the time). The bad news is that this doesn't necessarily work fast. In practice, it can take a fairly significant amount of time for process B on machine 2 to find out that process A on machine 1 has unlocked the coordination file, time that can be measured in tens of seconds. In short, NFS locking works but it can require patience and this makes it not necessarily the best option in cases like this.
(The corollary of this is that when you're testing this part of NFS locking to see if it actually works you need to wait for quite a while before declaring things a failure. Based on my experiences I'd wait at least a minute before declaring an NFS lock to be 'stuck'. Implications for impatient programs with lock timeouts are left as an exercise for the reader.)
I don't know if acquiring an NFS lock on a file after a delay normally causes your machine's kernel to flush cached information about the file. In an ideal world it would, but NFS implementations are often not ideal worlds and the NFS locking protocol is a sidecar thing that's not necessarily closely integrated with the NFS client. Certainly I wouldn't count on NFS locking to flush cached information on, say, the directory that the locked file is in.
In short: you want to test this stuff if you need it.
PS: Possibly this is obvious but when I started testing NFS locking to make sure it worked in our environment I was a little bit surprised by how slow it could be in cross-client cases.
2014-04-17
Partly getting around NFS's concurrent write problem
In a comment on my entry about NFS's problem with concurrent writes, a commentator asked this very good question:
So if A writes a file to an NFS directory and B needs to read it "immediately" as the file appears, is the only workaround to use low values of actimeo? Or should A and B be communicating directly with some simple mechanism instead of setting, say, actimeo=1?
(Let's assume that we've got 'close to open' consistency to start with, where A fully writes the file before B processes it.)
If I was faced with this problem and I had a free hand with A and
B, I would make A create the file with some non-repeating name and
then send an explicit message to B with 'look at file <X>' (using eg
a TCP connection between the two). A should probably fsync() the
file before it sends this message to make sure that the file's on the
server. The goal of this approach is to avoid B's kernel having any
cached information about whether or not file <X> might exist (or what
the contents of the directory are). With no cached information, B's
kernel must go ask the NFS fileserver and thus get accurate information
back. I'd want to test this with my actual NFS server and client just
to be sure (actual NFS implementations can be endlessly crazy) but I'd
expect it to work reliably.
Note that it's important to not reuse filenames. If A ever reuses a filename, B's kernel may have stale information about the old version of the file cached; at the best this will get B a stale filehandle error and at the worst B will read old information from the old version of the file.
If you can't communicate between A and B directly and B operates by scanning the directory to look for new files, you have a moderate caching problem. B's kernel will normally cache information about the contents of the directory for a while and this caching can delay B noticing that there is a new file in the directory. Your only option is to force B's kernel to cache as little as possible. Note that if B is scanning it will presumably only be scanning, say, once a second and so there's always going to be at least a little processing lag (and this processing lag would happen even if A and B were on the same machine); if you really want immediately, you need A to explicitly poke B in some way no matter what.
(I don't think it matters what A's kernel caches about the directory, unless there's communication that runs the other way such as B removing files when it's done with them and A needing to know about this.)
Disclaimer: this is partly theoretical because I've never been trapped in this situation myself. The closest I've come is safely updating files that are read over NFS. See also.