2017-12-21
Checking RAM DIMM information from inside Linux
Suppose, not entirely hypothetically, that you have some machines where you don't know exactly what their DIMMs are and how they're set up, and you'd like to. Obviously you can find out all of this information if you take the machine down, open it up, and inventory the DIMMs, but fortunately for your uptime you can extract a surprisingly large amount of information from within Linux without having to go that far.
If you have a NUMA machine, you can get a bunch of information about the NUMA memory hierarchy. This doesn't directly give you DIMM-level information, but may be necessary in order to figure out how your DIMMs are split up among sockets, NUMA zones, and so on.
The default first stop is often dmidecode
, which interprets DMI/
SMBIOS
information that's set up by the BIOS before Linux is booted. The
BIOS pulls this information from magical sources, but it's usually
accurate. The DIMM information is gotten with 'dmidecode --type
memory
', but what information and fields you get can vary a lot
from system to system. The actual DIMMs are 'Memory Device'(s),
and may come out like this:
Handle 0x001E, DMI type 17, 34 bytes Memory Device Array Handle: 0x001D Error Information Handle: 0x002E Total Width: 64 bits Data Width: 64 bits Size: 8192 MB Form Factor: DIMM Set: None Locator: DIMM_A1 Bank Locator: CPU1 Type: DDR3 Type Detail: Synchronous Registered (Buffered) Speed: 1600 MHz Manufacturer: CE00B304CE00 Serial Number: 34BD54B9 Asset Tag: 02411221 Part Number: M393B1K70DH0-CK0 Rank: 2 Configured Clock Speed: 1600 MHz
Now, here is a tricky question: is this an ECC DIMM, and is it being used in an ECC capable system? Your inclination may be to say it clearly isn't, since the total width is only 64 bits. Unfortunately, if you search for the part number on the Internet, you'll discover that it's Samsung ECC DDR3 memory, and the server is a Dell C6220 blade that is definitely ECC capable. Perhaps something has gone wrong, but my default assumption is that there's ECC, it's just that the SMBIOS information isn't reporting it for some reason.
(dmidecode
will report a section on 'Physical Memory Array' that
includes a 'Error Correction Type', but it's apparently not clear
if this represents the maximum capabilities or the current realities.
PC vendors being PC vendors, it probably varies, especially on
desktop systems.)
Having looked at a number of our servers, my conclusion is that if
dmidecode
reports a 'Total Width' larger than the 'Data Width'
(typically 72 and 64), you can definitely conclude that you have
ECC DIMMs. If it also reports that ECC is enabled in the 'Physical
Memory Array' section, ECC is almost certainly on. Otherwise, who
knows short of the kernel complaining about ECC problems.
The same information can be obtained through 'lshw -C memory
'.
This is somewhat more compact and sometimes can decode more things.
For example, for the same DIMM, it reports:
*-bank:0 description: DIMM DDR3 Synchronous 1600 MHz (0.6 ns) product: M393B1K70DH0-CK0 vendor: Samsung physical id: 0 serial: 34BD54B9 slot: DIMM_A1 size: 8GiB width: 64 bits clock: 1600MHz (0.6ns)
Here the vendor is correctly reported as Samsung, but we've lost
the 'bank locator' that in this case tells us which CPU socket the
DIMM is attached to. Lshw gets its DIMM information from the
DMI/SMBIOS data, it just prints it differently than dmidecode
does.
On some servers, the IPMI
system may provide some degree of access to DIMM information,
generally under some sort of 'asset management' tag. It's possible
that you can get at this information with things like ipmitool
,
but you may need to talk to the BMC
in another way, for example through a web browser to the BMC's web
interface (if it has one).
I wish I had better news to report, but as far as I know that's it for finding out DIMM information. You can at least get basic information, which is good enough to answer questions like 'are all the DIMM slots filled on this server' or 'where are all our 8 GB DIMMs', and I think things like the speed information and the part numbers are broadly trustworthy (the speed information probably somewhat more than the part numbers, because less probably breaks if the DIMMs report crazy part numbers to the BIOS and the usual aphorism applies).