C's main() is one of the places where Unix's user and kernel APIs differ

July 23, 2020

Modern Unixes often like to draw a legalistic distinction between the API provided to user space by the kernel and the Unix API provided to programs by the 'standard library', by which they mean the standard C library. Some people, me included, don't entirely like this (I've written about whether the C runtime and library is a legitimate part of the Unix API). However, regardless of what I might think about it, Unix has long had at least one place where there was a real difference between the normal API that everyone used and the API that the kernel actually implemented. I'm talking about the traditional C style main() entry point that starts your program.

Everyone knows the basic form of main(), with argc and argv; you're called with a count of the arguments and an array of strings. In slightly more advanced usage there is a third argument, envp, an array of environment variables. This format is very old in Unix. The two argument version of main() goes back to at least Research Unix V4's exec(2), while the three argument form with environment variables seems to appear in V7's exec(2).

However, this is not the actual program entry point that the V7 Unix kernel used when starting your program, and the actual entry point had a somewhat different API than main(). Conventionally, V7 C programs actually started at an assembly symbol called start; the simplest version of the assembly code involved is in crt0.s and it clearly does a certain amount of setup work. There are other versions of this startup in /usr/src/libc/csu that do various amounts of more work, such as arranging to profile your program.

(Research Unix V6 also had a crt0.s, but it's rather different; I think there are no loops, for example. If I understood PDP-11 assembly language I might have a better idea of what it was actually doing.)

In V7, the differences between the user API for main() and the kernel API are not huge. In current Unixes, there's often rather more going on, especially once you include dynamic loaders and things like the 'auxiliary vector' present in some Unixes. I suspect that the simplest version of a modern one to look at is musl libc for Linux, where crt1.c and the main libc bootstrap functions are relatively straightforward.

(Some of the code is because the C runtime environment needs to be set up (and yes, modern C has a runtime), but a certain amount of it is converting between how the kernel involves programs and how main() wants to be invoked. For example, notice how musl libc's main start function isn't called with argc as an explicit argument; instead it retrieves argc from memory.)

Sidebar: The interesting V7 trick with data address 0

At the end of every version of V7's crt0.s is a little bit that initially puzzled me:

.data
   .=.+2   / loc 0 for I/D; null ptr points here.

What this is doing is that it's reserving two bytes of space at the start of the data section. V7 Unix ran on PDP-11's that supported split instruction and data address space, so the data section starts at (data) address 0. Reserving two bytes at the start insures that no variable or other thing in the data section can be located at address 0 and so C NULL is always distinct from valid pointers.


Comments on this page:

By Ben Hutchings at 2020-07-25 20:02:16:

The v6 startup code expectst the kernel to pass arguments on the stack like this:

sp          -> argc
sp+2        -> argv[0]
...            ...
               NULL

It moves the stack pointer down by 2 and changes the top two entries to produce:

sp          -> argc
sp+2        -> argv
sp+4        -> argv[0]
...            ...
               NULL

If main() returns, it passes its return value to exit(), and finally if exit() returns(!) it calls the exit syscall.

The v7 startup code expects additional arguments on the stack:

               envp[0]
               ...
               NULL

For some reason it doesn't calculate sp+2*argc but instead loops over argv to find the start of the envp array. It also validates that envp < argv[0]; if not (presumably because it's running on an older kernel?) it sets envp to point to the preceding NULL of argv (i.e., an empty environment).

It does a similar adjustment of the stack to pass argc, argv, and envp to main() and also stores envp to the global environ.

Written on 23 July 2020.
« Contrasting the two common approaches to where programs start running
Some thoughts on us overlooking Illumos's syseventadm »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jul 23 00:18:46 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.