Two stories of how and why simultaneous multithreading works

November 27, 2021

Simultaneous multithreading has been controversial since Intel introduced hyper-threading. Some people felt they were getting a nice bonus; others thought they were getting a net negative (back in 2011 I sort of had this view). I don't have any well informed views on whether or not SMT is useful for me (or for us), but what I do have is two stories I've absorbed over the years over how it works and could benefit you.

The first story is that SMT works by covering up memory latency. When a core would otherwise have to stall the currently executing thread to wait for a memory fetch (or a memory store), it can instead instantly switch to another thread and perhaps get some additional work done. This switching can't be handled by the OS for various reasons; instead, the other thread of execution must be ready for the core's hardware to start executing its instructions. The simplest way to do this is to present the extra hardware thread context as an additional CPU. This then generally forces the processor to actually schedule back and forth between both threads, rather than starving the secondary thread until (and unless) the primary thread stalls.

(The memory fetch might be for either data or for branches and calls in the program's code. Of course, as we famously know due to Spectre and Meltdown, a core may continue on with speculative execution of the thread even when it's nominally stalling for a memory fetch.)

The second and more recent story is that SMT works partly by increasing the utilization of a core's execution units (EUs). Modern x86 processor cores have a number of execution units of each type in order to extract as much instruction level parallelism as possible (a superscalar processor). However, not all code can use all of those execution units at once, and some code can leave entire types of execution units totally idle (for example, integer only code leaves floating point EUs idle). If you have a core execute more than one thread at once, your EU utilization is what both of them can use, not just what one thread can use. This intrinsically requires both threads to be executing simultaneously, because otherwise they won't both be using EUs at the same time.

These two stories are probably both true these days, but to what extent they're each true will depend partly on what your code actually does. The worst case for SMT is probably dense, highly optimized code that makes minimal memory fetches and tries to use all of the available execution units.

Written on 27 November 2021.
« How we use the SLURM job scheduler system on our compute servers
The problem I have with Pip's dependency version handling »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Nov 27 23:46:43 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.