Wandering Thoughts archives

2019-03-13

StatfsPeculiarities

Peculiarities about Unix's `statfs()` or `statvfs()` API

On modern Unixes, the official interface to get information about a filesystem is statvfs(); it's sufficiently official to be in the Single Unix Specification as seen here. On Illumos it's an actual system call, statvfs(2). On many other Unixes (at least Linux, FreeBSD, and OpenBSD)), it's a library API on top of a statfs(2) system call (Linux, FreeBSD, OpenBSD). However you call it and however it's implemented, the underlying API of the information that gets returned is a little bit, well, peculiar, as I mentioned yesterday.

(In reality the API is more showing its age than peculiar, because it dates from the days when filesystems were simpler things.)

The first annoyance is that statfs() doesn't return the number of 'files' (inodes) in use on a filesystem. Instead it returns only the total number of inodes in the filesystem and the number of inodes that are free. On the surface this looks okay, and it probably was back in the mists of time when this was introduced. Then we got more advanced filesystems that didn't have a fixed number of inodes; instead, they'd make as many inodes as you needed, provided that you had the disk space. One example of such a filesystem is ZFS, and since we have ZFS fileservers, I've had a certain amount of experience with the results.

ZFS has to answer statfs()'s demands somehow (well, statvfs(), since it originated on Solaris), so it basically makes up a number for the total inodes. This number is based on the amount of (free) space in your ZFS pool or filesystem, so it has some resemblance to reality, but it is not very meaningful and it's almost always very large. Then you can have ZFS filesystems that are completely full and, well, let me show you what happens there:

cks@sanjuan-fs3:~$ df -i /w/220
Filesystem      Inodes IUsed IFree IUse% Mounted on
<...>/w/220        144   144     0  100% /w/220

I suggest that you not try to graph 'free inodes over time' on a ZFS filesystem that is getting full, because it's going to be an alarming looking graph that contains no useful additional information.

The next piece of fun in the statvfs() API is how free and used disk space is reported. The 'struct statvfs' has, well, let me quote the Single Unix Specification:

f_bsize    File system block size. 
f_frsize   Fundamental file system block size. 

f_blocks   Total number of blocks on file system
           in units of f_frsize. 

f_bfree    Total number of free blocks. 
f_bavail   Number of free blocks available to 
           non-privileged process.

When I was an innocent person and first writing code that interacted with statvfs(), I said 'surely f_frsize is always going to be something sensible, like 1 Kb or maybe 4 Kb'. Silly me. As you can find out using a program like GNU Coreutils stat(1), the actual 'fundamental filesystem block size' can vary significantly among different sorts of filesystems. In particular, ZFS advertises a 'fundamental block size' of 1 MByte, which means that all space usage information in statvfs() for ZFS filesystems has a 1 MByte granularity.

(On our Linux systems, statvfs() reports regular extN filesystems as having a 4 KB fundamental filesystem block size. On a FreeBSD machine I have access to, statvfs() mostly reports 4 KB but also has some filesystems that report 512 bytes. Don't even ask about the 'filesystem block size', it's all over the map.)

Also, notice that once again we have the issue where the amount of space in use must be reported indirectly, since we only have 'total blocks' and 'available blocks'. This is probably less important for total disk space, because that's less subject to variations than the total amount of inodes possible.

StatfsPeculiarities written at 23:46:13; Add Comment

2019-03-07

BufferedPipes

Exploring the mild oddity that Unix pipes are buffered

One of the things that blogging is good for is teaching me that what I think is common knowledge actually isn't. Specifically, when I wrote about a surprisingly arcane little Unix shell pipeline example, I assumed that it was common knowledge that Unix pipes are buffered by the kernel, in addition to any buffering that programs writing to pipes may do. In fact the buffering is somewhat interesting, and in a way it's interesting that pipes are buffered at all.

How much kernel buffering there is varies from Unix to Unix. 4 KB used to be the traditional size (it was the size on V7, for example, per the V7 pipe(2) manpage), but modern Unixes often have much bigger limits, and if I'm reading it right POSIX only requires a minimum of 512 bytes. But this isn't just a simple buffer, because the kernel also guarantees that if you write PIPE_BUF bytes or less to a pipe, your write is atomic and will never be interleaved with other writes from other processes.

(The normal situation on modern Linux is a 64 KB buffer; see the discussion in the Linux pipe(7) manpage. The atomicity of pipe writes goes back to early Unix and is required by POSIX, and I think POSIX also requires that there be an actual kernel buffer if you read the write() specification very carefully.)

On the one hand this kernel buffering and the buffering behavior makes perfect sense and it's definitely useful. On the other hand it's also at least a little bit unusual. Pipes are a unidirectional communication channel and it's pretty common to have unbuffered channels where a writer blocks until there's a reader (Go channels work this way by default, for example). In addition, having pipes buffered in the kernel commits the kernel to providing a certain amount of kernel memory once a pipe is created, even if it's never read from. As long as the read end of the pipe is open, the kernel has to hold on to anything it allowed to be written into the pipe buffer.

(However, if you write() more than PIPE_BUF bytes to a pipe at once, I believe that the kernel is free to pause your process without accepting any data into its internal buffer at all, as opposed to having to copy PIPE_BUF worth of it in. Note that blocking large pipe writes by default is a sensible decision.)

Part of pipes being buffered is likely to be due to how Unix evolved and what early Unix machines looked like. Specifically, V7 and earlier Unixes ran on single processor machines with relatively little memory and without complex and capable MMUs (Unix support for paged virtual memory post-dates V7, and I think wasn't really available on the PDP-11 line anyway). On top of making the implementation simpler, using a kernel buffer and allowing processes to write to it before there is a reader means that a process that only needs to write a small amount of data to a pipe may be able to exit entirely before the next process runs, freeing up system RAM. If writer processes always blocked until someone did a read(), you'd have to keep them around until that happened.

(In fact, a waiting process might use more than 4 KB of kernel memory just for various data structures associated with it. Just from a kernel memory perspective you're better off accepting a small write buffer and letting the process go on to exit.)

PS: This may be a bit of a just-so story. I haven't inspected the V7 kernel scheduler to see if it actually let processes that did a write() into a pipe with a waiting reader go on to potentially exit, or if it immediately suspended them to switch to the reader (or just to another ready to run process, if any).

(3 comments.)

BufferedPipes written at 22:43:42; Add Comment

2019-03-04

ShellPipelineIndeterminate

A surprisingly arcane little Unix shell pipeline example

In The output of Linux pipes can be indeterministic (via), Marek Gibney noticed that the following shell command has indeterminate output:

(echo red; echo green 1>&2) | echo blue

This can output any of "blue green" (with a newline between them), "green blue", or "blue"; the usual case is "blue green". Fully explaining this requires surprisingly arcane Unix knowledge.

The "blue green" and "green blue" outputs are simply a scheduling race. The 'echo green' and 'echo blue' are being run in separate processes, and which one of them gets executed first is up to the whims of the Unix scheduler. Because the left side of the pipeline has two things to do instead of one, often it will be the 'echo blue' process that wins the race.

The mysterious case is when the output is "blue" alone, and to explain this we need to know two pieces of Unix arcana. The first is our old friend SIGPIPE, where if a process writes to a closed pipe it normally receives a SIGPIPE signal and dies. The second is that 'echo' is a builtin command in shells today, and so the left side's 'echo red; echo green 1>&2' is actually all being handled by one process instead of the 'echo red' being its own separate process.

We get "blue" as the sole output when the 'echo blue' runs so soon that it exits, closing the pipeline, before the ~~right~~ left side can finish 'echo red'. When this happens the ~~right~~ left side gets a SIGPIPE and exits without running 'echo green' at all. This wouldn't happen if echo wasn't a specially handled builtin; if it was a separate command (or even if the shell forked to execute it internally), only the 'echo red' process would die from the SIGPIPE instead of the entire left side of the pipeline.

So we have three orders of execution:

The shell on the left side gets through both of its echos before the 'echo blue' runs at all. The output is "green blue"
The 'echo red' happens before 'echo blue' exits, so the left side doesn't get SIGPIPE, but 'echo green' happens afterwards. The output is "blue green".
The 'echo blue' runs and exits, closing the pipe, before the 'echo red' finishes. The shell on the left side of the pipeline writes output into a closed pipe, gets SIGPIPE, and exits without going on to do the 'echo green'. The output is "blue".

The second order seems to be the most frequent in practice, although I'm sure it depends on a lot of things (including whether or not you're on an SMP system). One thing that may contribute to this is that I believe many shells start pipelines left to right, ie if you have a pipeline that looks like 'a | b | c | d', the main shell will fork the a process first, then the b process, and so on. All else being equal, this will give a an edge in running before d.