2021-12-13
A bit on compilation's changing number of stages (and assembly)
Recently I was skimming Everything You Never Wanted To Know About Linker Script (via). One of the things it mentions is the traditional three stage model for transforming your source code into an executable program, where the compiler produces assembly code that it hands to the assembler, the assembler produces object code, and then the linker turns the object code into an executable. This is the historical practice and provides a nice simple mental model, but it's not necessarily how it works today. Various modern compilation environments skip right over the assembly language stage and have the compiler generate object code.
I don't know where the traditional three stage model originates,
but one place I definitely encountered it was in Unix, where this
was the normal way that Unix (classical) C compilers worked. The C
compiler wrote out assembler files, then the cc
frontend called
the as
assembler with them, then ran the ld
linker to produce
your final program. Although I don't know the reasons that compilers
traditionally worked this way, it has a variety of appealing
properties. You need an assembler anyway and producing binary object
code is generally more complicated and annoying than producing
textual assembly, so the compiler might as well delegate it to a
tool that's already there.
(Early C compilers had a fourth stage, the preprocessor, that ran before compilation. For various reasons, C compilers increasingly moved this into the main C compiler instead of having it as a separate program.)
The Plan 9 compilers were probably where I first heard of doing away with the assembler, although they definitely weren't the first to do it (I don't think Turbo Pascal had anything like a separate assembler stage, for example). As the story I heard goes, the Bell Labs people noticed that their C compiler was spending a surprising amount of time turning an internal binary representation of assembly code into a textual assembler source file, only to have the assembler spend a bunch of time turning the textual assembler source file back into another, closely related internal binary representation. So the Bell Labs people decided to cut out this work, having their compilers normally write out a binary format that the linker could directly consume.
The speed increase from avoiding this round trip of encoding to text and decoding from text again is the big reason that modern compilers may skip having a separate assembler stage. In some environments with captive intermediate object formats, the compiler may be in a position to give the linker more freedom in final instruction selection through providing it more information about what the compiler wants.
(Traditionally the linker had to assume that everything the assembler handed it was sacred, and the assembler itself was deliberately quite literal.)
PS: Even when there were separate stages, they didn't necessarily
write their output to disk. Gcc has had the -pipe
option for a
long time, and back when disks were very slow it used to be quite
popular (if you had enough memory).
PPS: 'gcc -v
' gives you really quite verbose output, in which
it's hard to see what commands the compiler frontend is actually
running. Clang is somewhat less verbose and easy to follow. Modern
gcc (on Linux) still seems to run an assembler, but clang doesn't
seem to.
Sidebar: The question of what is a stage of compilation
What constitutes a distinct and separate 'stage' of turning source code into an executable is a somewhat unclear thing, with no useful formal definition. I was going to say that using a separate executable makes a difference, but then when I looked I discovered that the V7 C compiler actually was three executables (and see also the makefile in c/), not even counting the separate C preprocessor. I don't think many people would consider compiling C programs on V7 to have that many stages.