Rasdaemon is what you want on Linux if you're seeing kernel MCE messages

August 18, 2022

Suppose, not hypothetically, that you're seeing in your kernel logs messages like this:

mce: [Hardware Error]: Machine check events logged

As explained in the Arch wiki entry on "Machine check exceptions", an MCE is generated by your CPU when the CPU detects that some sort of a hardware situation has happened.

By itself, the kernel doesn't do anything more than log these very non-specific messages. If you want to know what exact machine check exceptions happened, you need something that pulls additional information out of the kernel and the hardware. The program the Arch wiki will refer you to and that seems to mostly work for us is rasdaemon (also, also), which replaces the earlier mcelog. On Ubuntu, just installing the 'rasdaemon' package will do everything necessary.

On our AMD Zen based machines, all of the rasdaemon reports that we've seen create log messages that look like this:

<...>-2676499 [000]     0.682548: mce_record:           2022-08-18 12:21:31 -0400 Unified Memory Controller (bank=16), status= 9c2040000000011b, Corrected error, no action required., mci=CECC, mca= DRAM ECC error.
Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=0, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, misc= d01a000101000000, addr= 4dd42ac0, synd= 89010a400200, ipid= 9600150f00, mcgstatus=0, mcgcap= 117, apicid= 0

If you missed these messages in the logs, you can (on Ubuntu) also see them with 'ras-mc-ctl --errors':

1 2022-08-18 12:21:31 -0400 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=16), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x00000117, status=0x9c2040000000011b, addr=0x4dd42ac0, misc=0xd01a000101000000, walltime=0x62fe670a, cpuid=0x00800f82, bank=0x00000010

There's probably a way to find out that this is a (corrected) DRAM ECC error from this message, but it's not as obvious as what rasdaemon puts in the system logs. As a result, I prefer to look at the system logs and, at the moment, I consider the ras-mc-ctl database to be just a backup. However, according to Monitoring ECC memory on Linux with rasdaemon, ras-mc-ctl can be used to see if a particular DIMM is having problems. Monitoring ECC memory on Linux with rasdaemon also discusses how to map which DIMM is which and give them nice labels.

On our Ubuntu servers (which are a mixture of AMD and Intel CPUs), it appears harmless to install rasdaemon on machines that aren't experiencing memory errors, on both servers and more desktop focused motherboards and systems. Unfortunately this hasn't been my experience on my office Fedora desktop, where running rasdaemon seems to produce a stream of unclear complaints from both rasdaemon and abrt-server.

So far we've only captured MCEs from AMD Zen CPUs with rasdaemon (and only for DRAM ECC errors). We had one Intel-based machine with an apparently bad DIMM that would produce complaints, but we swapped out the DIMM before we got around to installing rasdaemon.

(There doesn't seem to be a great overview of MCE errors under Linux, with the kind of information that would let me understand these rasdaemon messages and some of the configuration it would like. For what there is, see eg the mcelog glossary page and some Linux kernel documentation and another version.)

Written on 18 August 2022.
« Some resources for looking at the current development version of Go
I wish Prometheus had a table-driven label remapping feature »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Aug 18 21:13:01 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.