Getting some information about the NUMA memory hierarchy of your server
If you have more than one CPU socket in a server, it almost certainly has non-uniform memory access, where some memory is 'closer' (faster to access) to some CPUs than others. You can also have NUMA even in single socket machines, depending on how things are implemented internally. This raises the question of how you can find out information about the NUMA memory hierarchy of your machines, because sometimes it matters.
The simple way of finding out how many NUMA zones you have is
probably lscpu
, in the 'NUMA nodeN ..
' section; this will
also tell you what logical CPUs are in what NUMA zones. A
typical output from a high-zone machine is:
NUMA node0 CPU(s): 0-7 NUMA node1 CPU(s): 8-15 NUMA node2 CPU(s): 16-23 NUMA node3 CPU(s): 24-31 NUMA node4 CPU(s): 32-39 NUMA node5 CPU(s): 40-47 NUMA node6 CPU(s): 48-55 NUMA node7 CPU(s): 56-63
CPU numbers need not be contiguous. Another one of our machines reports:
NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31
This generally means that you have some hyperthreading in action.
You can check this by looking at 'lscpu -e
' output, which here
reports that CPU 0 and CPU 16 are on the same node, socket, and
core.
Another way to get this information turns out to be 'numactl -H
'.
This not only reports nodes and the CPUs attached to them, it also
reports the total memory attached to each node, the free memory for
each node, and the big piece of information, 'node distances', which
tell you how relatively costly it is to get to one node's memory
from another NUMA node. This comes out in a nice table form, so let
me show you:
node distances: node 0 1 2 3 4 5 6 7 0: 10 14 23 23 27 27 27 27 1: 14 10 23 23 27 27 27 27 2: 23 23 10 14 27 27 27 27 3: 23 23 14 10 27 27 27 27 4: 27 27 27 27 10 14 23 23 5: 27 27 27 27 14 10 23 23 6: 27 27 27 27 23 23 10 14 7: 27 27 27 27 23 23 14 10
And here's the same information for the server with only two NUMA zones:
node distances: node 0 1 0: 10 21 1: 21 10
The second server has a simple setup that creates a simple NUMA hierarchy; it's a two-socket server using Intel Xeon E5-2680 CPUs. The first server is eight Xeon X6550 CPUs (apparently we turned hyperthreading off on it), organized in two physically separate blocks of four CPUs. Within the same block, a CPU has one close sibling (relative cost 14) and two further away CPUs (cost 23). All cross-block access is fairly costly but uniformly so, with a relative cost of 27 for access to each NUMA node's memory.
(Note that you can have multiple NUMA zones within the same socket, and reported relative costs that aren't socket dependent. We have one server with two Opteron CPUs and four NUMA nodes, two for each socket. The reported cross-node relative cost is a uniform 20.)
The master source for this information appears to be in /sys
,
specifically under /sys/devices/system/node
. The nodeN/distance
file there gives essentially one row of the node distances, while
nodeN/meminfo
has per-node memory usage information that's basically
a per-node version of /proc/meminfo
. There's also nodeN/vmstat
,
which is per-node VM system statistics.
For a given process, you can see some information about which nodes
it has allocated memory on by looking at /proc/<pid>/numa_maps
. Part of the information
will be reported as 'N0=65 N1=28
', which means that this process
has 65 pages from node 0 and 28 from node 1.
A massive amount of global memory state information is available
in /proc/zoneinfo
, and a breakdown of free page information is
in /proc/buddyinfo
; for more discussion of what that means, see
my entry on how the Linux kernel divides up your RAM.
There's also /proc/pagetypeinfo
for yet
more NUMA node related information.
(As far as I know, the 'node distances' are only meaningful as relative numbers and don't mean anything in absolute terms. As such I interpret the '10' that's used for a node's own memory as basically '1.0 multiplied by ten'. Presumably it's not 100 because you don't need that much precision in differences.)
|
|