What it means to support ECC RAM (especially for AMD Ryzen)

November 8, 2017

Ever since the AMD Ryzen series of CPUs was introduced, there's been a lot of confusion about whether they supported ECC RAM and to what degree. One of the sources of confusion and imprecision is there are a number of different possible meanings of 'supporting ECC RAM'. So let's run down the hierarchy:

  1. The system will power up and run with ECC RAM modules installed.

  2. Single-bit errors will (always) be corrected.
  3. Corrected single-bit errors will be reported and logged, so you can know that you have a problem.

  4. Double-bit errors will be detected, reported and logged, so you at least know when they've happened even though ECC can't fix them.
  5. Double-bit errors will fault and panic the system, rather than it continuing on with known memory errors.

When server-class systems are said to 'support ECC RAM', people mean that they do all the way up to at least #4 and often #5. People who buy servers would be very unhappy if you sold them one that was claimed to support ECC but you merely meant 'works with ECC RAM' or 'silently corrects single-bit errors'; this is not what they expect and want, even if 'silently corrects single-bit errors' means that ECC is doing something to help system reliability.

(With that said, correcting single-bit errors is not nothing, since single-bit errors are expected to be the majority of RAM errors. And if you believe that your RAM is good in general and it's just being hit by stray cosmic rays and other random things, not having reports is not a big issue because they probably wouldn't be telling you anything actionable. But server people really don't like to make those assumptions; they want reports so that if errors are frequent or not random, they can see.)

I think it's safe to say that people who specifically want ECC on non-server systems consider #2 to be the bare minimum. If the system lacks it and only 'supports' ECC in the sense of running with ECC RAM modules, you're basically paying extra for your RAM for nothing. A fair number of people would probably be reluctantly satisfied with this level, but I believe most people want up to at least #4 (where all errors are logged and correctable errors are fixed). Whether you want your desktop to reboot out from underneath you on an uncorrectable ECC error is likely something that opinions vary on.

In the absence of clear statements about what 'supporting ECC RAM' means in a non-server context (and perhaps even in a server one), people who want more than just the first level of (nominal) support are left with a great deal of uncertainties. As far as I know, this has been and continues to be the situation with AMD Ryzens and ECC RAM support; no one is prepared to officially make a clear statement about it, and without official statements we don't know what's guaranteed and what's not. For example, it's possible that there are microcode or chipset issues which mean that ECC error detection and correction isn't reliable.

(Some people have done testing with Ryzens, but that just shows what happens some of the time, under some test situations. For example, that some single-bit errors are detected, corrected, and logged doesn't mean that all of them are.)


Comments on this page:

Interesting. I never want #5, because if the error happened on an access from userspace (or from a VM), I want the kernel/hypervisor to kill only that process/VM, and mark that page as bad, and then log loudly.

Written on 08 November 2017.
« Link: Citation Needed [on array indexing in programming languages]
Why I'm not enthused about live patching kernels and systems »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Wed Nov 8 00:44:25 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.