Sometimes, the problem is in a system's BIOS

June 17, 2019

We have quite a number of donated Dell C6220 blade servers, each of which is a dual socket machine with Xeon E5-2680s. Each E5-2680 is an 8-core CPU with HyperThreads, so before we turned SMT off the machines reported as having 32 CPUs, or 16 if you either turned SMT off or had to disable one socket (and once you have to do both, you're down to 8 CPUs). These days, most of these machines have been put in a SLURM-based scheduling system that provides people with exclusive access to compute servers.

Once upon a time recently, we noticed that the central SLURM scheduler was reporting that one of these machines had two CPUs, not (then) 32. When we investigated, this turned out not to be some glitch in the SLURM daemons or a configuration mistake, but actually what the Linux kernel was seeing. Specifically, as far as the kernel could see, the system was a dual socket system with each socket having exactly one CPU (still Xeon E5-2680s, though). Although we don't know exactly how it happened, this was ultimately due to BIOS settings; when my co-worker went into the BIOS to check things, he found that the BIOS was set to turn off both SMT and all extra cores on each socket. Turning on the relevant BIOS options restored the system to its full expected 32-CPU count.

(We don't know how this happened, but based on information from our Prometheus metrics system it started immediately after our power failure; we just didn't notice for about a month and a half. Apparently the BIOS default is to enable everything, so this is not as simple as a reversion to defaults.)

If nothing else, this is a useful reminder to me that BIOSes can do weird things and can be set in weird ways. If nothing else makes sense, well, it might be in the BIOS. It's worth checking, at least.

(We already knew this about Dell BIOSes, of course, because our Dell R210s and R310s came set with the BIOS disabling all but the first drive. When you normally use mirrored system disks, this is first mysterious and then rather irritating.)

Written on 17 June 2019.
« My Mastodon remark about tiling window managers
A Let's Encrypt client feature I always want for easy standard deployment »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jun 17 23:16:37 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.