My understanding of modern C undefined behavior and its effects

August 15, 2013

Back in the old days, it was famously said that using undefined behavior in your C program gave the compiler license to delete all of your files if it felt like it. When people heard that we laughed, nodded sagely, and went cheerfully on our way because of course no actual compiler was ever going to react to undefined behavior in that way and everyone knew it. (The closest real compilers ever came to that was how early versions of GCC reacted to #pragma.)

This left a whole generation of programmers with the attitude that C's large collection of undefined and implementation defined behavior was no big deal. Different CPUs or compilers might behave differently but the whole result would be fundamentally sane and often even predictable in advance (given knowledge of CPU behavior).

In the modern world, as John Regehr has taught me, this is both wrong and dangerous. Modern compilers do not delete your files or launch ICBMs when they encounter undefined behavior, because that would still be very stupid. Instead they do something much more dangerous: modern compilers will assume that undefined behavior can't happen. This knowledge that certain things can't happen is then used in optimization; for example, the compiler may deduce things about variable values which then gets fed through into dead code elimination and pretty soon you are removing a security check because the compiler knows it can 'never' trigger (in proper code).

(That led to a cute Linux kernel security vulnerability, by the way.)

The practical upshot is that it is now basically impossible to reason about how a chunk of code will behave in the face of undefined behavior and anyways, it changes. To even start requires a thorough understanding of modern compiler optimizations and a ruthlessly objective skeptic's eye so that you can see what the code actually says, not what you think it does. Only then are you in a position to start following the implications of, say, dereferencing a structure pointer as part of local variable initialization before you explicitly check said pointer to see if it's NULL.

Or in short modern C compilers do terrifying things with undefined behavior.

PS: I recommend you read John Regehr's blog. It's hair-raising.

(This was inspired by C J Silverio pointing to this HN comment.)


Comments on this page:

From 68.57.101.136 at 2013-08-15 15:10:47:

I particularly like this quote from Regehr's blog post:

"A sufficiently advanced compiler is indistinguishable from an adversary."

From 91.52.243.226 at 2013-08-17 02:55:26:

I remember reading that the LLVM compiler emits invalid opcodes our stops compilation when hitting undefined behaviors. That would actually be a good thing when writing portable code our even in the presence of dead code elimination.

-- Baruch

By cks at 2013-08-17 05:25:17:

Code that always invokes undefined behavior is the easy case. The kind of code that John Regehr writes about is code that merely may invoke undefined behavior and the problem is that it is everywhere because C has so much undefined behavior. For example, integer overflow during arithmetic is undefined behavior but in most cases a compiler can't definitively prove either that it always happens or that it never happens.

By nothings at 2013-08-23 05:08:08:

I didn't see much about this Regehr's blog, at least the first page, so you might find these slides interesting:

http://www.google.com/search?q=%22dangerous+optimizations+and+a+loss+of+causality%22

(I can't link the PDF directly because the only way I know to find it is google, and google won't show me the direct link, and neither does my browser.)

The argument here is right back to your old discussions of standards vs. de facto standards etc.

Compiler writers (especially gcc) have decided that the written standard is the standard, and they are allowed to do anything it allows (what Seacord calls the "total license model").

People who have good reason to use C (I'd have once called us "system programmers", but the class of what C is used for is more complicated now) believe there is a de facto standard way that this stuff has worked and has always worked (what Seacord calls the "hardware model"). We consider the compiler writers' appeals to the standard as allowing this deviation as (a) terrible for compiling existing code, (b) terrible for writing clear and correct code (e.g. the integer arithmetic overflow stuff, or gcc's strict type aliasing analysis), and (c) a hideous abuse of the system because practicing and highly-skilled C programmers do not appear to be represented on the C standard committees, and so the fact that the standard allows the compiler writers to do what they're doing isn't even a reasonable defense of the situation when you view the whole ecosystem.

And it's weird, because you'd think compiler writers' goal would be to help people write fast and correct code, not to get the maximum possible result on SPEC benchmarks, but the latter priority is the only thing I can think of to explain their behavior.

I'm also reminded of an incident from perhaps a year ago, for which I cannot figure out the right thing to google search, in which someone who was closely following the new C++ standards noticed that by carefully analyzing the language of certain parts of the C++ standard, and carefully analyzing the language of a particular part of the STL (standard template library), an accompanying but separate standard, putting those two pieces together implied a strong requirement on any object that could be used in certain STL-ish contexts. This was a requirement that people were very interested in being able to rely on objects to have, but everyone believed it could only be required by contract--it wasn't require by the language--but now, he argued, it would in fact be invalid C++ for anyone to ever pass you an object of that type in this sort of application.

I don't know what the long-term fallout about this was (whether people agreed or disproved that this was implied by the standards), but my immediate reaction was 'that's a terrible thing to assume, obviously your normal highly-skilled programmer isn't going to have done this careful analysis, that's a stupid way for a spec to work'. (Obviously if the result were well-publicized, you could then expect them to know it, but that's closer to a de facto standard anyway.)

By cks at 2013-08-23 11:12:59:

I think a direct URL is Dangerous Optimizations and the Loss of Causality. It's certainly interesting reading.

I believe that most of the interesting John Regehr blog entries (for this purpose) can be found in his compilers category, although it looks like his major series on undefined C behavior was a while back; a classic couple of entries are his A Guide to Undefined Behavior in C and C++, Part 1, Part 2 and Part 3. As a bonus you'll get interesting entries like this discussion of how to get compilers to correctly optimize certain sorts of casting and C and C++ Aren’t Future Proof.

(He sometimes points to other interesting blog entries, like What Every C Programmer Should Know About Undefined Behavior #2/3. Why yes, I am noting down URLs here partly for my own later reference.)

From 87.79.78.105 at 2013-08-23 11:40:27:

Google won't show me the direct link

Google Tracking-B-Gone will help you. (GreaseMonkey script, rewrites the SERP URLs to remove the Google redirect.)

Aristotle Pagaltzis

Written on 15 August 2013.
« The pragmatics of an HTTP to HTTPS transition
Funding and the size of hardware you want to buy »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Aug 15 02:00:56 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.