Getting some information about the NUMA memory hierarchy of your server

November 19, 2017

If you have more than one CPU socket in a server, it almost certainly has non-uniform memory access, where some memory is 'closer' (faster to access) to some CPUs than others. You can also have NUMA even in single socket machines, depending on how things are implemented internally. This raises the question of how you can find out information about the NUMA memory hierarchy of your machines, because sometimes it matters.

The simple way of finding out how many NUMA zones you have is probably lscpu, in the 'NUMA nodeN ..' section; this will also tell you what logical CPUs are in what NUMA zones. A typical output from a high-zone machine is:

NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31
NUMA node4 CPU(s):     32-39
NUMA node5 CPU(s):     40-47
NUMA node6 CPU(s):     48-55
NUMA node7 CPU(s):     56-63

CPU numbers need not be contiguous. Another one of our machines reports:

NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

This generally means that you have some hyperthreading in action. You can check this by looking at 'lscpu -e' output, which here reports that CPU 0 and CPU 16 are on the same node, socket, and core.

Another way to get this information turns out to be 'numactl -H'. This not only reports nodes and the CPUs attached to them, it also reports the total memory attached to each node, the free memory for each node, and the big piece of information, 'node distances', which tell you how relatively costly it is to get to one node's memory from another NUMA node. This comes out in a nice table form, so let me show you:

node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  14  23  23  27  27  27  27 
  1:  14  10  23  23  27  27  27  27 
  2:  23  23  10  14  27  27  27  27 
  3:  23  23  14  10  27  27  27  27 
  4:  27  27  27  27  10  14  23  23 
  5:  27  27  27  27  14  10  23  23 
  6:  27  27  27  27  23  23  10  14 
  7:  27  27  27  27  23  23  14  10 

And here's the same information for the server with only two NUMA zones:

node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

The second server has a simple setup that creates a simple NUMA hierarchy; it's a two-socket server using Intel Xeon E5-2680 CPUs. The first server is eight Xeon X6550 CPUs (apparently we turned hyperthreading off on it), organized in two physically separate blocks of four CPUs. Within the same block, a CPU has one close sibling (relative cost 14) and two further away CPUs (cost 23). All cross-block access is fairly costly but uniformly so, with a relative cost of 27 for access to each NUMA node's memory.

(Note that you can have multiple NUMA zones within the same socket, and reported relative costs that aren't socket dependent. We have one server with two Opteron CPUs and four NUMA nodes, two for each socket. The reported cross-node relative cost is a uniform 20.)

The master source for this information appears to be in /sys, specifically under /sys/devices/system/node. The nodeN/distance file there gives essentially one row of the node distances, while nodeN/meminfo has per-node memory usage information that's basically a per-node version of /proc/meminfo. There's also nodeN/vmstat, which is per-node VM system statistics.

For a given process, you can see some information about which nodes it has allocated memory on by looking at /proc/<pid>/numa_maps. Part of the information will be reported as 'N0=65 N1=28', which means that this process has 65 pages from node 0 and 28 from node 1.

A massive amount of global memory state information is available in /proc/zoneinfo, and a breakdown of free page information is in /proc/buddyinfo; for more discussion of what that means, see my entry on how the Linux kernel divides up your RAM. There's also /proc/pagetypeinfo for yet more NUMA node related information.

(As far as I know, the 'node distances' are only meaningful as relative numbers and don't mean anything in absolute terms. As such I interpret the '10' that's used for a node's own memory as basically '1.0 multiplied by ten'. Presumably it's not 100 because you don't need that much precision in differences.)

Written on 19 November 2017.
« AMD Ryzens, their memory speed peculiarities, and ECC
StartCom gives up on its Certificate Authority business »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 19 02:21:09 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.