2024-03-02
Something I don't know: How server core count interacts with RAM latency
When I wrote about how the speed of improvement in servers may have slowed down, I didn't address CPU core counts, which is one area where the numbers have been going up significantly. Of course you have to keep those cores busy, but if you have a bunch of CPU-bound workloads, the increased core count is good for you. Well, it's good for you if your workload is genuinely CPU bound, which generally means it fits within per-core caches. One of the areas I don't know much about is how the increasing CPU core counts interact with RAM latency.
RAM latency (for random requests) has been relatively flat for a while (it's been flat in time, which means that it's been going up in cycles as CPUs got faster). Total memory access latency has apparently been 90 to 100 nanoseconds for several memory generations (although individual DDR5 memory module access is apparently only part of this, also). Memory bandwidth has been going up steadily between the DDR generations, so per-core bandwidth has gone up nicely, but this is only nice if you have the kind of sequential workloads that benefit from it. As far as I know, the kind of random access that you get from things like pointer chasing is all dependent on latency.
(If the total latency has been basically flat, this seems to imply that bandwidth improvements don't help too much. Presumably they help for successive non-random reads, and my vague impression is that reading data from successive addresses from RAM is faster than reading random addresses (and not just because RAM typically transfers an entire cache line to the CPU at once).)
So now we get to the big question: how many memory reads can you have in flight at once with modern DDR4 or DDR5 memory, especially on servers? Where the limit is presumably matters since if you have a bunch of pointer-chasing workloads that are limited by 'memory latency' and you run them on a high core count system, at some point it seems that they'll run out of simultaneous RAM read capacity. I've tried to do some reading and gotten confused, which may be partly because modern DRAM is a pretty complex thing.
(I believe that individual processors and multi-socket systems have some number of memory channels, each of which can be in action simultaneously, and then there are memory ranks (also) and memory banks. How many memory channels you have depends partly on the processor you're using (well, its memory controller) and partly on the motherboard design. For example, 4th generation AMD Epyc processors apparently support 12 memory channels, although not all of them may be populated in a given memory configuration (cf). I think you need at least N (or maybe 2N) DIMMs for N channels. And here's a look at AMD Zen4 memory stuff, which doesn't seem to say much on multi-core random access latency.)