2018-06-29
What 'PID rollover' is on Unix systems
On Unix, everything is a process (generally including the threads
inside processes, because that makes life simpler), and all processes
have a PID (Process ID). In theory, the only special PID is PID 1,
which is init
, which has various jobs and
which often causes your system to reboot if it dies (which isn't
required even if most Unixes do it). Some
Unixes also have a special 'PID 0', which is a master process in
the kernel (on Illumos PID 0 is sched
, and on FreeBSD it's called
[kernel]
). PIDs run from PID 1 upward to some maximum PID value
and traditionally they're used strictly sequentially, so PID X is
followed by PID X+1 and PID X+2 (even if some of the processes may
be very short-lived).
(OpenBSD uses randomized PIDs by default; FreeBSD can turn them on
by setting the kern.randompid
sysctl, at least according to
Internet searches. Normal Linux and Illumos are always sequential.)
Once, a very long time ago, Unix was a small thing and it ran on
small, slow machines that liked to use 16-bit integers, ie the DEC
PDP-11 series that was the home of Research Unix up through V7. In
V7, PIDs were C short
s, which meant that they had a natural maximum
value of 32767, and the kernel further constrained their maximum value
to be 29,999. What happened when you hit that point? Well, let's just
quote from newproc()
in slp.c:
/* * First, just locate a slot for a process * and copy the useful info from this process into it. * The panic "cannot happen" because fork has already * checked for the existence of a slot. */ retry: mpid++; if(mpid >= 30000) { mpid = 0; goto retry; }
(The V7 kernel had a lot of goto
s.)
This is PID rollover, or rather the code for it.
The magical mpid
is a kernel global variable that holds the last
PID that was used. When it hits 30,000, it rolls back over to 0,
gets incremented to be 1, and then we'll find that PID 1 is in use
already and try again (there's another loop for that). Since V7 ran
on small systems, there was no chance that you could have 30,000
processes in existence at once; in fact the kernel had a much
smaller hardcoded limit called NPROC
, which was
usually 150 (see param.h).
Ever since V7, most Unix systems have kept the core of this behavior. PIDs have a maximum value, often still 30,000 or so by default, and when your sequential PID reaches that point you go back to starting from 1 or a low number again. This reset is what we mean by PID rollover; like an odometer rolling over, the next PID rolls over from a high value to a low value.
(I believe that it's common for modern Unixes to reset PIDs to something above 1, so that the very low numbered PIDs can't be reused even if there's no process there any more. On Linux, this low point is a hardcoded value of 300.)
Since Unix is no longer running on hardware where you really want to use 16-bit integers, we could have a much larger maximum PID value if we wanted to. In fact I believe that all current Unixes use a C type for PIDs that's at least 32 bits, and perhaps 64 (both in the kernel and in user space). Sticking to signed 32 bit integers but using the full 2^31-1 integer range would give us enough PIDs that it would take more than 12 years of using a new PID every 500 microseconds before we had a PID rollover. However, Unixes are startlingly conservative so no one goes this high by default, although people have tinkered with the specific numbers.
(FreeBSD PIDs are officially 0 to 99999, per intro(2)
.
For other Unixes, see this SE question and its answers.)
To be fair, one reason to keep PIDs small is that it makes output
that includes PIDs shorter and more readable (and it makes it easier
to tell PIDs apart). This is both command output, for things like ps
and top
, and also your logs when they include PIDs (such as syslog).
Very few systems can have enough active or zombie processes that they'll
have 30,000 or more PIDs in use at the same time, and for the rest of us,
having a low maximum PID makes life slightly more friendly. Of course,
we don't have to have PID rollover to have low maximum PIDs; we can just
have PID randomization. But in theory PID rollover is just as good
and it's what Unix has always done (for a certain value of 'Unix' and
'always', given OpenBSD and so on).
In the grand Unix tradition, people say that PID rollover doesn't have issues, it just exposes issues in other code that isn't fully correct. Such code includes anything that uses daemon PID files, code that assumes that PID numbers will always be ascending or that if process B is a descendant of process A, it will have a higher PID, and code that is vulnerable if you can successfully predict the PID of a to-be-created process and grab some resource with that number in it. Concerns like these are at least part of why OpenBSD likes PID randomization.
How ZFS makes things like 'zfs diff
' report filenames efficiently
As a copy on write (file)system, ZFS can use the transaction group (txg) numbers that are embedded in ZFS block pointers to efficiently find the differences between two txgs; this is used in, for example, ZFS bookmarks. However, as I noted at the end of my entry on block pointers, this doesn't give us a filesystem level difference; instead, it essentially gives us a list of inodes (okay, dnodes) that changed.
In theory, turning an inode or dnode number into the path to a file
is an expensive operation; you basically have to search the entire
filesystem until you find it. In practice, if you've ever run 'zfs
diff
', you've likely noticed that it runs pretty fast. Nor is
this the only place that ZFS quickly turns dnode numbers into full
paths, as it comes up in 'zpool status
' reports about permanent
errors. At one level, zfs diff
and
zpool status
do this so rapidly because they ask the ZFS code in
the kernel to do it for them. At another level, the question is how
the kernel's ZFS code can be so fast.
The interesting and surprising answer is that ZFS cheats, in a way
that makes things very fast when it works and almost always works
in normal filesystems and with normal usage patterns. The cheat is
that ZFS dnodes record their parent's object number. Here, let's
show this in zdb
:
# zdb zdb -vvv -bbbb -O ssddata/homes cks/tmp/a/b Object lvl iblk dblk dsize dnsize lsize %full type 1285414 1 128K 512 0 512 512 0.00 ZFS plain file [...] parent 1284472 [...] # zdb -vvv -bbbb -O ssddata/homes cks/tmp/a Object lvl iblk dblk dsize dnsize lsize %full type 1284472 1 128K 512 0 512 512 100.00 ZFS directory [...] parent 52906 [...] microzap: 512 bytes, 1 entries b = 1285414 (type: Regular File)
The b
file has a parent
field that points to cks/tmp/a
, the
directory it's in, and the a
directory has a parent
field that
points to cks/tmp
, and so on. When the kernel wants to get the
name for a given object number, it can just fetch the object, look
at parent
, and start going back up the filesystem.
(If you want to see this sausage being made, look at zfs_obj_to_path
and zfs_obj_to_pobj
in zfs_znode.c.
The parent
field is a ZFS dnode system attribute,
specifically ZPL_PARENT
.)
If you're familiar with the twists and turns of Unix filesystems,
you're now wondering how ZFS deals with hardlinks, which can cause
a file to be in several directories at once and so have several
parents (and then it can be removed from some of the directories).
The answer is that ZFS doesn't; a dnode only ever tracks a single
parent, and ZFS accepts that this parent information can be
inaccurate. I'll quote the comment in zfs_obj_to_pobj
:
When a link is removed [the file's] parent pointer is not changed and will be invalid. There are two cases where a link is removed but the file stays around, when it goes to the delete queue and when there are additional links.
Before I get into the details, I want to say that I appreciate the
brute force elegance of this cheat. The practical reality is that
most Unix files today don't have extra hardlinks, and when they do
most hardlinks are done in ways that won't break ZFS's parent
stuff. The result is that ZFS has picked an efficient implementation
that works almost all of the time; in my opinion, the great benefit
we get from having it around are more than worth the infrequent
cases where it fails or malfunctions. Both zfs diff
and having
filenames show up in zpool status
permanent error reports are
very useful (and there may be other cases where this gets used).
The current details are that any time you hardlink a file to somewhere
or rename it, ZFS updates the file's parent
to point to the new
directory. Often this will wind up with a correct parent
even
after all of the dust settles; for example, a common pattern is to
write a file to an initial location, hardlink it to its final
destination, and then remove the initial location version. In this
case, the parent
will be correct and you'll get the right name.
The time when you get an incorrect parent
is this sequence:
; mkdir a b; touch a/demo ; ln a/demo b/ ; rm b/demo
Here a/demo
is the remaining path, but demo
's dnode will claim
that its parent is b
. I believe that zfs diff
will even report
this as the path, because the kernel doesn't do the extra work
to scan the b
directory to verify that demo
is present in it.
(This behavior is undocumented and thus is subject to change at the convenience of the ZFS people.)