A bit on compilation's changing number of stages (and assembly)

December 13, 2021

Recently I was skimming Everything You Never Wanted To Know About Linker Script (via). One of the things it mentions is the traditional three stage model for transforming your source code into an executable program, where the compiler produces assembly code that it hands to the assembler, the assembler produces object code, and then the linker turns the object code into an executable. This is the historical practice and provides a nice simple mental model, but it's not necessarily how it works today. Various modern compilation environments skip right over the assembly language stage and have the compiler generate object code.

I don't know where the traditional three stage model originates, but one place I definitely encountered it was in Unix, where this was the normal way that Unix (classical) C compilers worked. The C compiler wrote out assembler files, then the cc frontend called the as assembler with them, then ran the ld linker to produce your final program. Although I don't know the reasons that compilers traditionally worked this way, it has a variety of appealing properties. You need an assembler anyway and producing binary object code is generally more complicated and annoying than producing textual assembly, so the compiler might as well delegate it to a tool that's already there.

(Early C compilers had a fourth stage, the preprocessor, that ran before compilation. For various reasons, C compilers increasingly moved this into the main C compiler instead of having it as a separate program.)

The Plan 9 compilers were probably where I first heard of doing away with the assembler, although they definitely weren't the first to do it (I don't think Turbo Pascal had anything like a separate assembler stage, for example). As the story I heard goes, the Bell Labs people noticed that their C compiler was spending a surprising amount of time turning an internal binary representation of assembly code into a textual assembler source file, only to have the assembler spend a bunch of time turning the textual assembler source file back into another, closely related internal binary representation. So the Bell Labs people decided to cut out this work, having their compilers normally write out a binary format that the linker could directly consume.

The speed increase from avoiding this round trip of encoding to text and decoding from text again is the big reason that modern compilers may skip having a separate assembler stage. In some environments with captive intermediate object formats, the compiler may be in a position to give the linker more freedom in final instruction selection through providing it more information about what the compiler wants.

(Traditionally the linker had to assume that everything the assembler handed it was sacred, and the assembler itself was deliberately quite literal.)

PS: Even when there were separate stages, they didn't necessarily write their output to disk. Gcc has had the -pipe option for a long time, and back when disks were very slow it used to be quite popular (if you had enough memory).

PPS: 'gcc -v' gives you really quite verbose output, in which it's hard to see what commands the compiler frontend is actually running. Clang is somewhat less verbose and easy to follow. Modern gcc (on Linux) still seems to run an assembler, but clang doesn't seem to.

Sidebar: The question of what is a stage of compilation

What constitutes a distinct and separate 'stage' of turning source code into an executable is a somewhat unclear thing, with no useful formal definition. I was going to say that using a separate executable makes a difference, but then when I looked I discovered that the V7 C compiler actually was three executables (and see also the makefile in c/), not even counting the separate C preprocessor. I don't think many people would consider compiling C programs on V7 to have that many stages.


Comments on this page:

These days things are even more interesting. Even though a lot of 'canonical' stages have been indeed integrated into compilers, compilers did grow quite a few additional stages.

If you add -save-temps -v to clang's compilation options, it will split the compilation into independent phases and will print out the commands to perform them. Or you can use -ccc-print-phases which will print a graph of data dependencies between the stages.

For something like a CUDA or HIP compilation it will be quite a bit more than three stages and the sequence is not always linear.

If you use LTO, things may get even more interesting.

As for clang and as -- whether to use integrated as depends on the target and on whether -f[no-]integrated-as option is used.

By Walex at 2021-12-14 09:58:17:

«The Plan 9 compilers were probably where I first heard of doing away with the assembler»

Actually before UNIX I used quite a few mainframe and mini OSes and none of them had compilers that emitted assembler code, they generated object code directly; assembler language was considered a language like the others, and the assembler a compiler like the other; there were also many in-between "MOHLL" (Machine Oriented High Level Languages") like Wirth's PL/360 that also generate object code directly.

IIRC correctly the original UNIX compilers (BCPL, B and C, in particular the second generation C compiler, S. Johnson's PCC [Portable C Compiler]) generated assembly in part because they also targeted platforms where the object format was not well (or at all) documented, but it could have been simple lazyness.

Later on many compilers kept generating assembly under UNIX/Linux in part because ELF is actually a bit complex (this was before the GNU binary utils), and anyhow many UNIX systems did not use ELF, so a portable compiler was easier to write if it generates assembly. Which is the same reason why many compilers started generating C instead of assembly, or nowadays JS "webassembly".

By Walex at 2021-12-14 11:00:28:

«none of them had compilers that emitted assembler code, they generated object code directly»

As a reminder, before UNIX the "done thing" for portability was not to generate assembler, C or webasm, but to generate some p-code/bytecode, in binary format too directly, and then provide either or both an intepreter or (often optimizing) translator to native code and binary format. The p-code story is old and interesting, with many interesting aspects, in particular TDF/ANDF, which was part of RSRE's Ten15, and eventually resulted in the TenDRA compiler and framework.

https://www.mca-ltd.com/martin/Ten15/introduction.html https://www.tendra.org/

The more currently popular equivalents are CLang and LLVM.

As the story I heard goes, the Bell Labs people noticed that their C compiler was spending a surprising amount of time turning an internal binary representation of assembly code into a textual assembler source file, only to have the assembler spend a bunch of time turning the textual assembler source file back into another, closely related internal binary representation.

That's inspiring stuff. I wonder if cutting out wasteful serialization and parsing could improve the efficiency of other UNIX-style software, like pipelines.

Written on 13 December 2021.
« How Vim's visual mode has wound up being useful for me
Finding Ubuntu (and Debian) packages that are in odd states »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Dec 13 22:20:29 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.