Compute GPUs can have odd failures under Linux (still)
Back in the early days of GPU computation, the hardware, drivers, and software were so relatively untrustworthy that our early GPU machines had to be specifically reserved by people and that reservation gave them the ability to remotely power cycle the machine to recover it (this was in the days before our SLURM cluster). Things have gotten much better since then, with things like hardware and driver changes so that programs with bugs couldn't hard-lock the GPU hardware. But every so often we run into odd failures where something funny is going on that we don't understand.
We have one particular SLURM GPU node that has been flaky for a while, with the specific issue being that every so often the NVIDIA GPU would throw up its hands and drop off the PCIe bus until we rebooted the system. This didn't happen every time it was used, or with any consistent pattern, although some people's jobs seemed to regularly trigger this behavior. Recently I dug up a simple to use GPU stress test program, and when this machine's GPU did its disappearing act this Saturday, I grabbed the machine, rebooted it, ran the stress test program, and promptly had the GPU disappear again. Success, I thought, and since it was Saturday, I stopped there, planning to repeat this process today (Monday) at work, while doing various monitoring things.
Since I'm writing a Wandering Thoughts entry about it, you can probably guess the punchline. Nothing has changed on this machine since Saturday, but all today the GPU stress test program could not make the GPU disappear. Not with the same basic usage I'd used Saturday, and not with a different usage that took the GPU to full power draw and a reported temperature of 80C (which was a higher temperature and power draw than the GPU had been at when it disappeared, based on our Prometheus metrics). If I'd been unable to reproduce the failure at all with the GPU stress program, that would have been one thing, but reproducing it once and then not again is just irritating.
(The machine is an assembled from parts one, with an RTX 4090 and a Ryzen Threadripper 1950X in an X399 Taichi motherboard that is probably not even vaguely running the latest BIOS, seeing as the base hardware was built many years ago, although the GPU has been swapped around since then. Everything is in a pretty roomy 4U case, but if the failure was consistent we'd have assumed cooling issues.)
I don't really have any theories for what could be going on, but I suppose I should try to find a GPU stress test program that exercises every last corner of the GPU's capabilities at full power rather than using only one or two parts at a time. On CPUs, different loads light up different functional units, and I assume the same is true on GPUs, so perhaps the problem is in one specific functional unit or a combination of them.
(Although this doesn't explain why the GPU stress test program was able to cause the problem on Saturday but not today, unless a full reboot didn't completely clear out the GPU's state. Possibly we should physically power this machine off entirely for long enough to dissipate any lingering things.)
|
|