Why you can't put zero bytes in Unix command line arguments

May 4, 2018

One sensible reaction to all of the rigmarole with 'grep -P' I went through in yesterday's entry in order to search for a zero byte (a null byte) is to ask why I didn't just use a zero byte in the command line argument:

fgrep -e ^@ -l ...

(Using the usual notation for a zero byte.)

You can usually type a zero byte directly at the terminal, along with a number of other unusual control characters (see my writeup of this here), and failing that you could write a shell script in an editor and insert the null byte there. Ignoring character set encoding issues for the moment, this works for any other byte, but if you try it you'll discover that it doesn't work for the zero byte. If you're lucky, your shell will give you an error message about it; if you're not, various weird things will happen. This is because the zero byte can't ever be put into command line arguments in Unix.

Why is ultimately simple. This limitation exists because the Unix API is fundamentally a C API (whether or not the C library and runtime are part of the Unix API), and in C, strings are terminated by a zero byte. When Unix programs such as the shell pass command line arguments to the kernel as part of the exec*() family of system calls, they do so as an array of null-terminated C strings; if you try to put a null byte in there as data, it will just terminate that command line argument early (possibly reducing it to a zero-length argument, which is legal but unusual). When Unix programs start they receive their command line arguments as an array of C strings (in C, the argv argument to main()), and again a null byte passed in as data would be seen as terminating that argument early.

This is true whether or not your shell and the program you're trying to run are written in C. They can both be written in modern languages that are happy to have zero bytes in strings, but the command line arguments moving between them are being squeezed through an API that requires null-terminated strings. The only way around this would be a completely new set of APIs on both sides, and that's extremely unlikely at this point.

Because filenames are also passed to the kernel as C strings, they too can't contain zero bytes. Neither can environment variables, which are passed between programs (through the kernel) as another array of C strings.

As a corollary, certain character set encodings really don't work as locales on Unix because they run into this. Any character set encoding that can generate zero bytes as part of its characters is going to have serious problems with filenames and command line arguments; one obvious example of such a character set is UTF-16. I believe the usual way for Unixes to deal with a filesystem that's natively UCS-2 or UTF-16 is to encode and decode to UTF-8 somewhere in the kernel or the filesystem driver itself.

Written on 04 May 2018.
« Using grep to hunt around for null bytes in text files
Modern Unix GUIs now need to talk to at least one C library »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri May 4 00:08:25 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.