Wandering Thoughts archives

2024-02-12

Linux kernel boot messages and seeing if your AMD system has ECC

In general, consumer x86 desktops have generally not supported ECC memory, at least not if you wanted the 'ECC' bit to actually do anything. With Intel this seems to have been an issue of market segmentation, but things with AMD were more confusing. The initial AMD Ryzen series seemed to generally support ECC in the CPU, but the motherboard support was questionable, and even if your motherboard accepted ECC DIMMs there was an open question of whether the ECC was doing anything on any particular motherboard (cf). Later Ryzens have apparently had an even more confusing ECC support story, but I'm out of touch on that.

When we put together my work desktop we got ECC DIMMs for it and I thought that theoretically the motherboard supported ECC, but I've long wondered if it was actually doing anything. Recently I was looking into this a bit for reasons and ran across Rain's ECC RAM on AMD Ryzen 7000 desktop CPUs, which contained some extremely useful information about how to tell from your boot messages on AMD systems. I'm going to summarize this and add some extra information I've dug out of things.

Modern desktop CPUs talk to memory themselves, but not quite directly from the main CPU; instead, they have a separate on-die memory controller. On AMD Zen series CPUs, this is the AMD Unified Memory Controller, and there are special interfaces to talk to it. As I understand things, ECC is handled (or not) in the UMC, where it receives the raw bits from your DIMMs (if your DIMMs are wide enough, which you may or may not be able to tell). Therefor, to have ECC support active, you need ECC DIMMs and for ECC to be enabled in your UMC (which I believe is typically controlled by the BIOS, assuming the UMC supports ECC, which depends on the CPU).

In Linux, reporting and managing ECC is handled through a general subsystem called EDAC, with specific hardware drivers. The normal AMD EDAC driver is amd64_edac, and as covered by Rain, it registers for memory channels only if the memory channel has ECC on in the on-die UMC. When this happens, you will see a kernel message to the effect of:

EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT)

It follows that if you do see this kernel message during boot, you almost certainly have fully supported ECC on your system. It's very likely that your DIMMs are ECC DIMMs, your motherboard supports ECC in the hardware and in its BIOS (and has it enabled in the BIOS if necessary and applicable), and your CPU is willing to do ECC with all of this. Since the above kernel message comes from my office desktop, it seems almost certain that it does indeed fully support ECC, although I don't think I've ever seen any kernel messages about detecting and correcting ECC issues.

You can see more memory channels in larger systems and they're not necessarily sequential; one of our large AMD machines has 'MC0' and 'MC2'. You may also see a message about 'EDAC PCI0: Giving out device to [...]', which is about a different thing.

In the normal Linux kernel way, various EDAC memory controller information can be found in sysfs under /sys/devices/system/edac/mc (assuming that you have anything registered, which you may not on a non-ECC system). This appears to include counts of corrected errors and uncorrected errors both at the high level of an entire memory controller and at the level of 'rows', 'ranks', and/or 'dimms' depending on the system and the kernel version. You can also see things like the memory EDAC mode, which could be 'SECDED' (what my office desktop reports) or 'S8ECD8ED' (what a large AMD server reports).

(The 'MC<n>' number reported by the kernel at boot time doesn't necessarily match the /sys/devices/system/edac/mc<n> number. We have systems which report 'MC0' and 'MC2' at boot, but have 'mc0' and 'mc1' in sysfs.)

The Prometheus host agent exposes this EDAC information as metrics, primarily in node_edac_correctable_errors_total and node_edac_uncorrectable_errors_total. We have seen a few corrected errors over time on one particular system.

Sidebar: EDAC on Intel hardware

While there's an Intel memory controller EDAC driver, I don't know if it can get registered even if you don't have ECC support. If it is registered with identified memory controllers, and you can see eg 'SECDED' as the EDAC mode in /sys/devices/system/edac/mc/mcN, then I think you can be relatively confident that you have ECC active on that system. On my home desktop, which definitely doesn't support ECC, what I see on boot for EDAC (with Fedora 38's kernel 6.7.4) is:

EDAC MC: Ver: 3.0.0
EDAC ie31200: No ECC support
EDAC ie31200: No ECC support

As expected there are no 'mcN' subdirectories in /sys/devices/system/edac/mc.

Two Intel servers where I'm pretty certain we have ECC support report, respectively:

EDAC MC0: Giving out device to module skx_edac controller Skylake Socket#0 IMC#0: DEV 0000:64:0a.0 (INTERRUPT)

and

EDAC MC0: Giving out device to module ie31200_edac controller IE31200: DEV 0000:00:00.0 (POLLED)

As we can see here, Intel CPUs have more than one EDAC driver, depending on CPU generation and so on. The first EDAC message comes from a system with a Xeon Silver 4108, the second from a system with a Xeon E3-1230 v5.

AMDWithECCKernelMessages written at 22:37:18; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.