2017-07-02
Re-applying CPU thermal paste fixed my CPU throttling issues
Back at the start of May, my office workstation started reporting thermal throttling problems when I had all four cores fully busy:
kernel: CPU1: Core temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
As usual, I stuck these messages into some Internet searches, and the general advice I found was to re-apply thermal paste to the CPU, because apparently inexpensive thermal paste can dry out and get less effective over time (this assumes that you've exhausted obvious measures like blowing out all of the dust and making sure your case fans are working). I put this off for various reasons, including that I was going on vacation and the procedure seemed kind of scary, but eventually things pushed me into doing it.
The short version: it worked. I didn't destroy my office machine's CPU, it was not too annoying to get the standard Intel CPU fan off and then on again, and after my re-application my office machine's CPU doesn't thermally throttle any more and runs reasonably cool. As measured by the CPU itself, when I build Firefox using all four cores the temperature now maxes out around 71 C, and this was what previously ran headlong into those thermal throttling issues (which I believe happen when the CPU reaches 90 C).
(Note that this is with an i5-2500 CPU, which has a 95 W TDP, and the stock Intel cooler. I could probably have gotten the temperature lower by also getting a better aftermarket cooler, but I didn't feel like trying to talk work into spending the extra money for that. Especially when I want to replace the machine anyway.)
In fact my office machine's CPU is now clearly cooler than my identical home machine's CPU while doing the same Firefox build. The situation is not completely comparable (my home machine has probably been exposed to more dust than my work machine, although I try to keep it cleaned out), but this suggests that maybe my home machine would also benefit from me redoing its CPU thermal paste. Alternately I could get around to replacing it with a new home machine, which would hopefully render the issue mostly moot (although if I wind up assembling said new home machine myself, I'll get to apply CPU thermal paste to it).
(It wouldn't be entirely moot, because I'd like to have my current home machine be a functioning backup for any new machine since I don't have a laptop or other additional spare machine lying around.)
PS: I used ArctiClean to clean off the CPU's old thermal paste and Arctic Silver 5 as the new thermal paste. I didn't do any particular research on this, I just picked products that I'd heard of and people seem to talk about favorably.
(This sort of follows up my earlier mention of this.)
Moving to smaller fileservers for us probably means no more iSCSI SAN
In our environment, one major thing that drives us towards relatively big fileservers is aggregating and lowering the overhead of servers. Regardless of how big or small it is, any server has a certain minimum overhead cost due to needing things like a case and power supply, a motherboard, and a CPU. The result of this per-server overhead is economies of scale; a single server with 16 disk bays almost certainly costs less than two servers with 8 disk bays each.
We have a long history of liking to split our fileservers from our disk storage. Our current fileservers and our past generation of fileservers have both used iSCSI to talk to backend disk enclosures, and the generation of fileservers before them used Fibre Channel to talk to FC hardware RAID boxes. Splitting up the storage from the fileservice this way requires buying extra machines, which costs more; what has made this affordable is aggregating a fairly decent amount of disks in each box, so we don't have to buy too many extra ones.
If we're going to have smaller fileservers, as I've come to strongly believe we want, we're going to need more of them. If we're going to keep a similar design to our current setup, we would need more iSCSI backends to go with them. All of this means more machines and more costs. In theory we could lower costs by continuing to use 16-disk backends and share them between (smaller) fileservers (so two new fileservers would share a pair of backends), but in practice this would make our multi-tenancy issues worse and we would likely resist the idea fairly strongly. And we'd still be buying more fileservers.
If we want to buy a similar number of machines in total for our next generation fileservers but shrink the number of disks and the amount of storage space supported by each fileserver, the obvious conclusion is that we must get rid of the iSCSI backends. Hosting disks on the fileservers themselves has some downsides (per my entry about our SAN tradeoffs), but at a stroke it cuts the number of machines per fileserver from three to one. We could double the number of fileservers and still come out ahead on raw machine count. In a 10G environment, it also eliminates the need for two expensive 10G switches for the iSCSI networks themselves (and we'd want to go to 10G for the next generation of fileservers).
If we want to reduce the size of our fileservers but keep an iSCSI environment, we're almost certainly going to be faced with unappetizing tradeoffs. Considering the cost of 10G switch ports as well as everything else, our most likely choice would be to stop using two backends per fileserver; instead each fileserver would talk to a single 16-disk iSCSI backend (still using mirrored pairs of disks). This would increase the overall number of servers, but not hugely (we would go from 9 servers total for our HD-based production fileservers to 12 servers; the three fileservers would become six, and then we'd need six backends to go with them).
(It turns out that I also wrote about this issue a couple of years ago. At the time we weren't as totally convinced that our current fileservers are too big as designed, although we were certainly thinking about it, and I was less pessimistic about the added costs for extra servers if we shrink how big each fileserver is and so need more of them. (Or maybe the extra costs just hadn't struck me yet.))