The cost of memory access across a NUMA machine can (probably) matter

November 30, 2017

We recently had an interesting performance issue reported to us by a researcher here. We have a number of compute machines, none of them terribly recent; some of them are general access and some of them can be booked for exclusive usage. The researcher had a single-core job (I believe using R) that used 50 GB or more of RAM. They first did some computing on a general-access compute server with Xeon E5-2680s and 96 GB of RAM, then booked one of our other servers with Xeon X6550s and 256 GB of RAM to do more work on (possibly work that consumed significantly more RAM). Unfortunately they discovered that the server they'd booked was massively slower for their job, despite having much more memory.

We don't know for sure what was going on, but our leading theory is NUMA memory access effects because the two servers have significantly different NUMA memory hierarchies. In fact they are the two example servers from my entry on getting NUMA information from Linux. The general access server had two sockets for 48 GB of RAM per socket, while the bookable compute server with 256 GB of RAM had eight sockets and so only 32 GB of RAM per socket. To add to the pain, the high-memory server also appears to have a higher relative cost for access to the memory of almost all of the other sockets. So on the 256 GB machine, memory access was likely going to other NUMA nodes significantly more frequently and then being slower to boot.

Having said that, I just investigated and there's another difference; the 96 GB machine has DDR3 1600 MHz RAM, while the 256 GB machine has DDR3 RAM at 1333 Mhz (yes, they're old machines). This may well have contributed to any RAM-related slowdown and makes me glad that I checked; I don't usually even consider RAM module speeds, but if we think there's a RAM-related performance issue it's another thing to consider.

I found the whole experience to be interesting because it pointed out a blind spot in my usual thinking. Before the issue came up, I just assumed that a machine with more memory and more CPUs would be better, and if it wasn't better it would be because of CPU issues (here they're apparently generally comparable). That NUMA layout (and perhaps RAM speed) made the 'big' machine substantially worse was a surprise. I'm going to have to remember this for the future.

PS: The good news is that we had another two-socket E5-2680 machine with 256 GB that the researcher could use, and I believe they're happy with its performance. And with 128 GB of RAM per socket, they can fit even quite large R processes into a single socket's memory.

Comments on this page:

Have you looked at the Intel memory latency checking tool to observe latency when the system is idle and busy?:

I think Brendan Gregg also did a write up on using PMCs to troubleshoot memory-related issues.

By cks at 2017-12-14 11:07:07:

I wasn't aware of the Intel tool before now, but it looks interesting. When various researchers aren't using all of the interesting machines, I may run on it on the various machines involved here and see what it says about their characteristics.

(And thanks for mentioning it.)

Written on 30 November 2017.
« Sometimes the right thing to do about a spate of spam is nothing (probably)
We're broadly switching to synchronizing time with systemd's timesyncd »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Nov 30 00:07:52 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.