2021-01-29
Illustrating the importance of fully multi-core program building today
I have an enduring interest in comparing the from scratch Firefox build time on my office AMD machine and my home Intel machine, which are from the same era and have very similar configurations but drastically different Firefox build times. One of the things that I have noticed about the difference, and about building Firefox in general, is, well, I will quote my tweet:
One major area where the Firefox Nightly build takes longer on my AMD machine than on my Intel one is the end stage of building the Rust webrender, webrender_bindings, and especially gkrust. This seems to be single-core in Rust, so the lower single-core performance really hurts.
As it happens, I'm sort of wrong here, at least in general (writing
Wandering Thoughts gives me plenty of opportunities to
be wrong once I research things more). I don't know much about
Firefox's overall build process, but it definitely builds its Rust
components using Cargo, the
standard Rust tool for this. Cargo itself will do building in
parallel by default and Firefox doesn't turn that off; as a result,
there's a cargo
process running from very early on in the Firefox
build process with multiple concurrent rustc
processes beneath
it.
(I don't know how the Firefox build processes balances the Cargo
concurrency with the simultaneous C++ compilation concurrency it's
getting. It doesn't seem to invoke cargo
with any special flags to
limit concurrency but it also doesn't flood my machine.)
However, toward the end of the Firefox build process, my AMD machine
will spend a significant portion of the build time (multiple minutes)
with rustc
running alone on a single core, apparently primarily
building gkrust itself. This single core build time is a clear
bottleneck in building Firefox on my AMD machine (and is visible
to some extent on my Intel one). Since rustc
's memory usage keeps
climbing, this may be some final step of assembling the gkrust crate
together instead of actually compiling things, but it's still a
clear single-core bottleneck. Depending on how long the whole process
takes, this single-core Rust time can be a quarter of my entire
Firefox build time on my AMD machine.
I'm not picking on Rust here; it's just that Firefox and Rust's role in building it makes a handy example. Building things concurrently is hard in general, and if it is the linking stage that's the single-core bottleneck that's even harder; linking has historically been challenging to make a multi-threaded activity. But at the same time it's increasingly important to do as much as you can here, in both the language and the build system. Any single-threaded build stage in a large program can kill build speeds.
(This is kind of an inverted version of Amdahl's law. Although I suppose
if the final rustc
is churning through a lot of memory, that might
not help, especially if it's relatively random memory accesses; RAM
latency remains comparatively terrible, and my office AMD machine
doesn't have the fastest memory.)
Forecasting drive failures is not always as useful as it sounds
Recently, I said that we've found a SMART attribute that can predict SSD failures in our environment (and it later did predict the failure of one more SSD). This sounds great, but in practice it's turned out to be less useful than it might seem. The reason for this is pretty simple; supposing that we have a good indication that a drive is going to fail at some time in the future, but not when, what should we do about it?
(I'm going to assume the drive doesn't have any actual problems, like periodic read errors, just some SMART attributes that make you expect it will fail in the future.)
The core issue is that in a well run server environment, preemptively replacing a probably failing disk is a tradeoff. It effectively moves a 'failure' forward in time and puts it in a situation where you (probably) control things. If you can't forecast when the disk is likely to fail (just that it's probably going to at some indefinite point in the future), you don't know how much you've moved the failure forward; you might have moved it a lot, losing a lot of useful life from the disk.
(In many cases you won't be able to return the disk for a warranty replacement merely because you don't like some SMART attributes. You may be able to put the disk in a test or scratch system and let it run to failure, then get it replaced under warranty.)
I say 'in a well run server environment' because in such an environment, the failure of a single disk should never cause serious problems like the loss of important data or having a vital system down. As has been observed many times, plenty of disks fail with no changes to SMART data at all and many more fail with no definite signs of problems; you cannot count on SMART to let you fix a failing disk before it causes an explosion.
(I'll admit that we're not perfect here.)
On the mirror side of this, the less certain it is that the drive is going to fail in the reasonably near future, the higher the costs of a preemptive replacement are. We might even replace a perfectly serviceable disk, unnecessarily costing ourselves some time and money for nothing (and perhaps some disruption too, since not all of our systems have hot swap disks). The easier it is to preemptively replace a probably failing disk without disruption or much cost, generally also the easier it is to do the same thing when the disk actually does fail.
In concrete terms, the SSD that failed recently had SMART signs we recognized at the end of November and it started triggering a warning at the start of January. We still let it run into the ground on its own. It simply didn't seem worth it to act earlier, and we weren't completely sure that the SSD would actually fail (and we certainly couldn't have forecasted when, for various reasons).
(If we had lots of money, sure, we would replace drives preemptively in this sort of situation and eat the cost. But system administration is partly about prioritization, about getting the most value for your limited resources.)