Wandering Thoughts archives

2020-07-21

Contrasting the two common approaches to where programs start running

When a program (in a broad sense) is executed, it must start running somewhere. There are two common approaches for choosing what code is the first code executed, each with different tradeoffs that make some languages and system environments pick one over the other.

The simplest approach is to say that the first code in the program starts running. This is what Python, Perl, many versions of BASIC, and shell scripts all do; execution starts from the top of your file and marches down. This is also something that's done at the machine code level; a simple execution environment might load your code into memory (often at a fixed address) and then simply transfer control to the start of that block of memory. This is, for example, how PC BIOSes load and execute the Master Boot Record (MBR) from hard drives; the first 512 byte sectors is loaded into memory and they jump to it.

The other common approach is to say that execution starts at a user defined entry point, at some address or identifier that is set by the program. Sometimes this entry point is specified by you as you build or execute the program; sometimes it is set by convention, such as C's main(). Your main() function doesn't have to be at the start of your program's code or at any specific address in memory; the system will arrange to find it and begin execution there (at least at a conceptual level, with some handwaving). This approach is common in compiled languages, especially ones that support building a single entity from multiple source files.

The tradeoff of the 'start at the start' approach is that you have to care about the order of your code, both for code within a file and also the order of files (if your program is made up of multiple files). For 'start at the start', layout matters. Many 'start at the start' languages are most naturally used with programs that live in a single file; among other things, this means that you don't need to worry about the order of multiple files. This is commonly the case for interpreted languages, so 'start at the start' is common for them.

(It's not universal in interpreted languages, even on Unix. For example, awk is only sometimes ordered; you can put a 'BEGIN' rule anywhere, but code order matters if multiple rules act on a single line.)

The tradeoff of the entry point approach is that you have to define the entry point, either by hand or through convention. If defined by hand (at build or run time), you have to do some extra work and you have an extra thing to keep track of; if defined by convention, it's a bit harder to add some code to run at the start of your program (you have to add it to the front of the code at the defined entry point, respecting any ordering requirements, and can't just add a block at the very start of the file). The advantage of the entry point approach is that the order of code and files no longer matters.

(Also, conventions are arbitrary choices and are essentially magic. The reason your C programs start at main() is 'because', which is unsatisfying to some people and something you just have to memorize.)

It's common for compiled languages to support building programs from multiple source files that have no specific order among themselves, because this is the easiest approach for humans to deal with; we can name our source files whatever makes sense and don't have to maintain them in some careful order. This pretty much forces the entry point model. Supporting the 'start at the start' model would require people to maintain an order that the source files were specified in during compilation, and not just use 'cc -o barney *.o' or the equivalent.

(This entry was sparked by the Hacker News discussion of my exploration of why Python doesn't require a 'main' function. As mentioned, Python is a 'start at the start' language and it has an execution model to support that.)

PS: On modern Unixes that use ELF format executables, you can see the entry address of executables with 'readelf -h <program>' and then looking at the 'Entry point address'. Programs generally have a wide variety of entry point addresses.

tech/ProgramStartTwoApproaches written at 23:36:41; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.