The Linux kernel's internals showing through in the specifics of an NFS bug
On Mastodon, I said:
What's fascinating about this particular kernel bug to me is how clearly I can see the kernel's implementation poking through in what the bug is and what's required to reproduce it. The more I refine things, the more I can guess where the problem probably is.
Let me give you the details on this, so you can see how the kernel's implementation is poking through. I'll start with the bug and its reproduction.
We recently ran into a deadly problem with Alpine on Ubuntu 18.04 that is actually a general kernel NFS client problem. After I refined my test program down far enough, here is what is required to manifest the bug:
- On a NFS client, open a file read-write and read all the way to the end of the file. It's probably sufficient to just read the last N bytes or KB of the file, for some value of N (it might even be enough to read the last byte of the file).
- In your program, keep the file open read-write and wait for it to grow in size.
- On another machine (either another NFS client or the fileserver), append data to the end of the file.
- In your program, attempt to read the new data after the old end of file. The new data from immediately after the old end of file up to the next 4 KB boundary will be zero bytes; after that, it will be regular contents.
You must hold the file open in read-write mode while you wait in order for this bug to manifest; if you close the file or hold it open read-only, this doesn't happen (even if you open it read-write again after you detect the size change). This happens with both NFSv3 and NFSv4, and the OS of the NFS fileserver doesn't matter.
So now let's talk about this shows the bones of the kernel in action (assuming that I'm correct about what's going on inside the kernel).
Like pretty much everyone these days, the Linux kernel caches file data in memory, in the page cache. As you might suspect from the name, the page cache stores things in units of pages, which are almost always 4 KB (at least on x86 machines). However, files are not always even multiples of 4 KB in size, which means that the very end of a file, when cached in memory, will not take up all of a page; what you have is a partial page, where some amount of the front of the page is valid but the rest is not. It seems both plausible and likely that the kernel zeroes page cache pages (at least partial ones) before trying to put data in them, rather than leaving random stale bytes sitting around in them (not zeroing them would be a great way to accidentally leak kernel memory).
In NFS, file data can change behind the client kernel's back, and in particular a file can be extended. When the NFS client code has a partial page from the end of the file in the page cache and the file's size grows, it has to remember that the rest of the file's data is not in the page but must be filled in from the server. When you don't hold the file open read-write, this process of filling in clearly works correctly. When you hold the file open read-write, for some reason the kernel appears to lose track of the fact that it has to fill in the rest of the partial page from the server; instead it believes that it has a full page and so it gives you whatever data is in the remainder of the page. This data is, fortunately, all zero bytes.
(I say fortunately because this means that it's both obvious and not a kernel memory data leak. If the kernel gave you whatever random bytes were in the physical page of RAM from its previous use, this could be very bad.)
This doesn't happen for local files (at least normally) because local files are coherent; all writes go through the page cache, so when you extend a file the new data fills in the existing partial page in the page cache. I suspect that this local coherence is part of how the bug happens, and perhaps there is a bit of the general kernel code that assumes that this incomplete partial page situation just can't happen for files open read-write; if the file says its length is X, and that X fully covers a page in the page cache, all the contents of that page are always valid.
PS: Interested parties can find a program to demonstrate this here. It takes various arguments so you can play around with some things to reproduce or not reproduce the bug. I have deliberately resisted my natural temptation to provide and explore all possible permutations of what the test program does, because I don't think the permutations matter. I put in some things because I wanted to test them (and doing so was useful, because it discovered that keeping the file in read-write mode was a crucial element), and other things because I wanted to demonstrate that they don't matter.
(This entry is partly a dry run for sending a bug report to the Linux NFS mailing list; I wanted to make sure I could explain it reasonably coherently and that I had things straight in my head.)