C was not created as an abstract machine (of course)

February 1, 2023

Today on the Fediverse I saw a post by @nytpu:

Reminder that the C spec specifies an abstract virtual machine; it's just that it's not an interpreted VM *in typical implementations* (i.e. not all, I know there was a JIT-ing C compiler at some point), and C was lucky enough to have contemporary CPUs and executable/library formats and operating systems(…) designed with its VM in mind

(There have also been actual C interpreters, some of which had strict adherence to the abstract semantics, cf (available online in the Usenix summer 1988 proceedings).)

This is simultaneously true and false. It's absolutely true that the semantics of formal standard C are defined in terms of an abstract (virtual) machine, instead of any physical machine. The determined refusal of the specification to tie this abstract machine in concrete CPUs is the source of a significant amount of frustration in people who would like, for example, for there to be some semantics attached to what happens when you dereference an invalid pointer. They note that actual CPUs running C code all have defined semantics, so why can't C? But, well, as is frequently said, C Is Not a Low-level Language (via) and the semantics of C don't correspond exactly to CPU semantics. So I agree with nytpu's overall sentiments, as I understand them.

However, it's absolutely false that C was merely 'lucky' that contemporary CPUs, OSes, and so on were designed with its abstract model in mind. Because the truth is the concrete C implementations came first and the standard came afterward (and I expect nytpu knows this and was making a point in their post). Although the ANSI C standardization effort did invent some things, for the most part C was what I've called a documentation standard, where people wrote down what was already happening. C was shaped by the CPUs it started on (and then somewhat shaped again by the ones it was eagerly ported to), Unix was shaped by C, and by the time that the C standard was producing drafts in the mid to late 1980s, C was shaping CPUs through the movement for performance-focused RISC CPUs (which wanted to optimize performance in significant part for Unix programs written in C, although they also cared about Fortran and so on).

(It's also not the case that C only succeeded in environments that were designed for it. In fact C succeeded in at least one OS environment that was relatively hostile to it and that wanted to be used with an entirely different language.)

Although I'm not absolutely sure, I suspect that the C standard defining it in abstract terms was in part either enabled or forced by the wide variety of environments that C already ran in by the late 1980s. Defining abstract semantics avoided the awkward issue of blessing any particular set of concrete ones, which at the time would have advantaged some people while disadvantaging others. This need for compromise between highly disparate (C) environments is what brought us charming things like trigraphs and a decision not to require two's-complement integer semantics (it's been proposed to change this, and trigraphs are gone in C23, also).

Dating from when ANSI C was defined and C compilers became increasingly aggressive about optimizing around 'undefined behavior' (even if this created security holes), you could say that modern software and probably CPUs has been shaped by the abstract C machine. Obviously, software increasingly has to avoid doing things that will blow your foot off in the model of the C abstract machine, because your C compiler will probably arrange to blow your foot off in practice on your concrete CPU. Meanwhile, things that aren't allowed by the abstract machine are probably not generated very much by actual C compilers, and things that aren't generated by C compilers don't get as much love from CPU architects as things that do.

(This neat picture is complicated by the awkward fact that many CPUs probably runs significantly more C++ code than true C code, since so many significant programs are written in the former instead of the latter.)

It's my view that recognizing that C comes from running on concrete CPUs and was strongly shaped by concrete environments (OS, executable and library formats, etc) matters for understanding the group of C users who are unhappy with aggressively optimizing C compilers that follow the letter of the C standard and its abstract machine. Those origins of C were there first, and it's not irrational for people used to them to feel upset when the C abstract machine creates a security vulnerability in their previously working software because the compiler is very clever. The C abstract machine is not a carefully invented thing that people then built implementations of, an end in and of itself; it started out as a neutral explanation and justification of how actual existing C things behaved, a means to an end.

Comments on this page:

By moshev at 2023-02-02 09:21:48:

I have long thought that C compilers ought to have a standardised "system semantics" mode where "undefined behaviour" means "whatever the system (CPU and OS if any) does". That already de-facto exists as various flags for GCC, Clang, MSVC, ICC - in general any compiler that aggressively optimises based on undefined behaviour has a flag to turn those off. C is currently used mainly for two goals - low-level system programming and high-performance computing. It would be a boon to the former to standardise a mode where the language behaves like the system you're compiling for.

What some people refuse to understand is that true portability requires formal semantics, but low-level languages needn't contort themselves to achieve it. Ada is a language simultaneously lower-level and higher-level than the C language, because it avoids unnecessarily specifying irrelevant details and has many dedicated ways to specify those same details when relevant, whereas the C language specifies irrelevant details and relies on implicit corner cases to permit certain behaviour whenever wanted. It also results in the traditional scattering of documentation everywhere.

All of this could've been solved with foresight, but then it wouldn't be the C language. The clever compilers are needed to work around the gross inefficiency of the language. People point at TCC, but next to no one uses it in any serious way.

By Flatfinger at 2023-02-04 16:07:39:

In many cases, it may be useful to treat a program as running on an abstract machine whose semantics aren't as precise as the underlying hardware, but are still much tighter than "Anything can happen" UB. One major limitation of the Standard's abstract machine is that it has no sensible way of treating a function like `test2()` below.

   unsigned char arr[65537];
   unsigned test(unsigned x)
     unsigned i=1;
     while((i & 0xFFFF) != x)
     if (x < 65536)
       arr[x] = 1;
     return i;
   void test2(unsigned x)

Here, when `test()` is invoked from `test2()`, no iteration of the loop would perform any action that was observably sequenced before the following code, and it would thus be useful to postpone execution of the loop indefinitely (which would, of course, yield observable behavior equivalent to simply omitting the execution of the loop).

Unfortunately, the C Standard's Abstract Machine model requires that any situation where optimizations might yield behavior inconsistent with precise sequential program execution must be classified as Undefined Behaivor. Under the abstraction model processed by clang, the code would invoke UB any time `x` exceeds 65535, and thus the store to `arr[x]` may be performed unconditionally.

Written on 01 February 2023.
« I've had bad luck with transparent hugepages on my Linux machines
A gotcha when making partial copies of Prometheus's database with rsync »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Feb 1 23:18:30 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.