2024-06-12
The Linux kernel NFS server and reconnecting client NFS filehandles
Unlike some other Unix NFS servers, the Linux kernel NFS server attempts to solve the NFS server 'subtree' export problem, along with a related permissions problem that is covered in the exportfs(5) manual page section on no_subtree_check. To quote the manual page on this additional check:
subtree checking is also used to make sure that files inside directories to which only root has access can only be accessed if the filesystem is exported with no_root_squash (see below), even if the file itself allows more general access.
In general, both of these checks require finding a path that leads to the file obtained from a NFS filehandle. NFS filehandles don't contain paths; they normally only contain roughly the inode number, which is a flat, filesystem-wide reference to the file. The NFS server calls this 'reconnection', and it is somewhat complex and counterintuitive. It also differs for NFS filehandles of directories and files.
(All of this is as of kernel 6.10-rc3, although this area doesn't seem to change often.)
For directories, the kernel first gets the directory's dentry from the dentry cache (dcache); this dentry can be 'disconnected' (which mostly means it was newly created due to this lookup) or already connected (in general, already set up in the dcache). If the dentry is disconnected, the kernel immediately reconnects it. Reconnecting a specific directory dentry works like this:
- obtain the dentry's parent directory through a filesystem specific method (which may more or less look up what '..' is in the directory).
- search the parent directory to find the name of the directory entry that matches the inode number of the dentry you're trying to reconnect. (A few filesystems have special code to do this more efficiently.)
- using the dcache, look up that name in the parent directory to get the name's dentry.
- verify that this new dentry and your original dentry are the same (which guards against certain sorts of rename races).
It's possible to have multiple disconnected dentries on the way to the filesystem's mount point; if so, each level follows this process. The obvious happy path is that the dcache already has a fully connected dentry for the directory the NFS client is working on, in which case all of this can be skipped. This is frequently going to be the case if clients are repeatedly working on the same directories.
Once the directory's dentry is fully connected (ie, all of its parents are connected), the kernel NFS server code will check if it is 'acceptable'. If the export uses no_subtree_check (which is now the default), this acceptability check always answers 'yes'.
For files, things are more complicated. First, the kernel checks to see if the initial dentry for the file (and any aliases it may have) is 'acceptable'; if the export uses no_subtree_check the answer is always 'yes', and things stop. Otherwise, the kernel uses a filesystem specific method to obtain the (or a) directory the file is in, reconnects the directory using the same code as above, then does steps 2 through 4 of the 'directory reconnection' process for the file and its parent directory in order to check against renames (which will involve at least one scan of the parent directory to discover the file's name). Finally with all of this done and a verified, fully connected dentry for the file, the kernel does the acceptability check again and returns the result.
Because the kernel immediately reconnects the dentries of directory NFS file handles before looking at the status of subtree checks, you really want those directories to have dentries that are already in the dcache (and fully connected). Every directory NFS filehandle with a dentry that has to be freshly created in disconnected state means at least one scan of a possibly large parent directory, and more scans of more directories if the parent directory itself isn't in the dcache too.
I'm not sure of how the dcache shrinks, and especially if filesystems can trigger removing dcache entries because the filesystem itself wants to remove the inode entry. The general kernel code that shrinks a filesystem's associated dcache and inodes triggers dcache shrinking first and inode shrinking second, with the comment that the inode cache is pinned by the dcache.
Sidebar: Monitoring NFS filehandle reconnections
If you want to see how much reconnection is happening, you'll need to use bpftrace (or some equivalent). The total number of NFS filehandles being looked at is found by counting calls to exportfs_decode_fh_raw(). If you want to know how many reconnections are needed, you want to count calls to reconnect_path(); if you want to count how many path components had to be reconnected, you want to (also) count calls to reconnect_one(). All of these are in fs/exportfs/expfs.c. The exportfs_get_name() call searches for the name for a given inode in a directory, and then the lookup_one_unlocked() call does the name to dentry lookup needed for revalidation, and I think it will probably fall through to a filesystem directory lookup.
(You can also look at general dcache stats, as covered in my entry on getting some dcache information, but I don't think this dcache lookup information covers all of the things you want to know here. I don't know how to track dentries being dropped and freed up, although prune_dcache_sb() is part of the puzzle and apparently returns a count of how many dentries were freed up for a particular filesystem superblock.)