My interesting experience with rapid repeated PID rollover on Linux

June 30, 2018

In the normal course of events, PID rollover is one of those things that I know about but that I don't really expect to ever actively observe, especially on a workstation style machine. Linux may roll over PIDs after reaching PID 32767 (by default), but to to do this in even four days requires a consistent process creation and recycling rate of over five a minute for those four days, and my machines aren't that busy, at least not unless there's something very unusual going on. So I was kind of surprised when one day I noticed in passing that my office machine had reached PID 4,097,894 despite it only having been up for a relatively short time. I felt that clearly something was wrong, probably in the kernel, and so at first my suspicions fell on ZFS on Linux kernel threads, because that's what stood out. Since I use the ZoL development git tip, it also matched my initial belief that the issue had only appeared recently. Both parts of this turned out to be wrong.

To cut a long story short, it turned out that building Go from source routinely churns through 200,000 or more PIDs (mostly from running its self-tests). Go's build process has probably been doing this for some time, but I didn't notice until a package set the kernel.pid_max sysctl and so preserved the high PIDs that resulted from me routinely rebuilding Go. Before kernel.pid_max was raised from Linux's default of 32K, the Go build process had been silently causing PID rollovers at a rate of about once a minute during tests. I also do full builds of Firefox from source every so often, and while they only churn through about 20,000 PIDs that's still enough to probably provoke a PID rollover on a reasonably common basis.

What's striking to me about this experience is that I didn't notice anything. Until I saw the chronyd log entry with its startlingly high PID, I had no idea this was happening, and it had been happening for months and months on a routine basis. I didn't notice during the PID rollovers and I didn't notice afterward, which means that nothing broke in any visible way from rapid, frequent PID rollovers. As someone who's been using Unix for some time, this is both pleasant and a bit startling. It's certainly how things are supposed to work, but it's not necessarily been how they actually did in practice. I guess programs have just evolved to the point where they're basically fine with both PID rollover and use of high PIDs.

Sidebar: What's behind this PID rollover

The difference in PID churn between Firefox's build process and Go's build and test process is instructive. Firefox churns through 20,000 PIDs, apparently mostly by running tons of compiles; at a rough ballpark estimate it's using slightly under one PID per CPU per second over its entire build process (although some of it is irritatingly serialized). This is more or less the traditional highly parallel build experience.

Go builds about three to four faster and doesn't seem to be starting lots of processes during it, despite churning through ten times as many PIDs. Instead, it appears to use a great deal of internal parallelism during both compilation and especially testing. Go's test phase not only saturates your CPUs, it also appears to go through (real) threads at a relatively ferocious rate. Since creating and destroying threads within a process is much faster than starting new processes, Go can churn through PIDs here at a rate that building Firefox can't even come close to.

(This is a little bit surprising to me. Go code makes heavy use of goroutines, but they're normally multiplexed onto a much smaller number of real threads and my impression was that real threads are generally preserved, not aggressively created and discarded. It's possible that various Go tests cause large sudden surges in the need for real threads and then after such a test finishes the Go runtime decides it has too many threads for its current needs (with a new set of less demanding concurrent tests) and throws most of them away.)

Written on 30 June 2018.
« What 'PID rollover' is on Unix systems
Understanding the first imperative of a commercial Certificate Authority »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jun 30 21:31:36 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.