Something I don't know: How server core count interacts with RAM latency

March 2, 2024

When I wrote about how the speed of improvement in servers may have slowed down, I didn't address CPU core counts, which is one area where the numbers have been going up significantly. Of course you have to keep those cores busy, but if you have a bunch of CPU-bound workloads, the increased core count is good for you. Well, it's good for you if your workload is genuinely CPU bound, which generally means it fits within per-core caches. One of the areas I don't know much about is how the increasing CPU core counts interact with RAM latency.

RAM latency (for random requests) has been relatively flat for a while (it's been flat in time, which means that it's been going up in cycles as CPUs got faster). Total memory access latency has apparently been 90 to 100 nanoseconds for several memory generations (although individual DDR5 memory module access is apparently only part of this, also). Memory bandwidth has been going up steadily between the DDR generations, so per-core bandwidth has gone up nicely, but this is only nice if you have the kind of sequential workloads that benefit from it. As far as I know, the kind of random access that you get from things like pointer chasing is all dependent on latency.

(If the total latency has been basically flat, this seems to imply that bandwidth improvements don't help too much. Presumably they help for successive non-random reads, and my vague impression is that reading data from successive addresses from RAM is faster than reading random addresses (and not just because RAM typically transfers an entire cache line to the CPU at once).)

So now we get to the big question: how many memory reads can you have in flight at once with modern DDR4 or DDR5 memory, especially on servers? Where the limit is presumably matters since if you have a bunch of pointer-chasing workloads that are limited by 'memory latency' and you run them on a high core count system, at some point it seems that they'll run out of simultaneous RAM read capacity. I've tried to do some reading and gotten confused, which may be partly because modern DRAM is a pretty complex thing.

(I believe that individual processors and multi-socket systems have some number of memory channels, each of which can be in action simultaneously, and then there are memory ranks (also) and memory banks. How many memory channels you have depends partly on the processor you're using (well, its memory controller) and partly on the motherboard design. For example, 4th generation AMD Epyc processors apparently support 12 memory channels, although not all of them may be populated in a given memory configuration (cf). I think you need at least N (or maybe 2N) DIMMs for N channels. And here's a look at AMD Zen4 memory stuff, which doesn't seem to say much on multi-core random access latency.)


Comments on this page:

NUMA has more of an impact. In my experience PostgreSQL performance is most correlated with the STREAMS benchmark and Amazon AWS’s biggest instances underperform some cheaper ones due to that NUMA penalty.

By Twirrim at 2024-03-03 21:26:14:

NUMA is your biggest concern when it comes to RAM latency, and with increasing core counts, it's only going to get worse. NUMA has a lot of quirks to it that can dramatically influence performance.

Without going too deep in to the subject, but the cores in your system are grouped together into NUMA nodes, each node is directly attached to a particular subset of memory, and indirectly attached to the rest via the other nodes, paying the penalty of that extra hop between it and the memory. That adds noticeable latency to every request.

It can have some really significant impact. For example, Oracle has been exploring having ktext replicated in to each NUMA domain on arm64 (which in server class chips tends to be even more "NUMA"ish), https://lwn.net/Articles/956900/, "[the patches] show a gain of between 6% and 17% for database-centric like workloads. When combined with userspace awareness of NUMA, this can result in a gain of over 50%." Having to reach across to the other NUMA node to get to the executable code in the kernel turns out to be an expensive and common operation.

It only gets worse from there, for example the linux page cache isn't fully NUMA aware. I know of someone who tripped up on this benchmarking NUMA nodes. They thought they had two very different performance NUMA nodes in the system they were benchmarking. In reality, it turned out the mysql client library got cached in one NUMA node's memory, during the previous benchmark run, and so the calls for the functions exposed by the client library were having to go cross-NUMA!

CXL etc. that are in the pipeline will also make these kinds of concerns increasingly important, as they talk about CXL in terms of adding a NUMA node hop or two cost.

Written on 02 March 2024.
« Options for your Grafana panels when your metrics change names
X graphics rendering as contrasted to Wayland rendering »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Sat Mar 2 22:54:58 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.