Why you can't put zero bytes in Unix command line arguments
One sensible reaction to all of the rigmarole with '
grep -P' I
went through in yesterday's entry
in order to search for a zero byte (a null byte) is to ask why I
didn't just use a zero byte in the command line argument:
fgrep -e ^@ -l ...
(Using the usual notation for a zero byte.)
You can usually type a zero byte directly at the terminal, along with a number of other unusual control characters (see my writeup of this here), and failing that you could write a shell script in an editor and insert the null byte there. Ignoring character set encoding issues for the moment, this works for any other byte, but if you try it you'll discover that it doesn't work for the zero byte. If you're lucky, your shell will give you an error message about it; if you're not, various weird things will happen. This is because the zero byte can't ever be put into command line arguments in Unix.
Why is ultimately simple. This limitation exists because the Unix
API is fundamentally a C API (whether or not the C library and
runtime are part of the Unix API), and in C,
strings are terminated by a zero byte. When Unix programs such as
the shell pass command line arguments to the kernel as part of the
exec*() family of system calls, they do so as an array of
null-terminated C strings; if you try to put a null byte in there
as data, it will just terminate that command line argument early
(possibly reducing it to a zero-length argument, which is legal but
unusual). When Unix programs start they receive their command line
arguments as an array of C strings (in C, the
argv argument to
main()), and again a null byte passed in as data would be seen
as terminating that argument early.
This is true whether or not your shell and the program you're trying to run are written in C. They can both be written in modern languages that are happy to have zero bytes in strings, but the command line arguments moving between them are being squeezed through an API that requires null-terminated strings. The only way around this would be a completely new set of APIs on both sides, and that's extremely unlikely at this point.
Because filenames are also passed to the kernel as C strings, they too can't contain zero bytes. Neither can environment variables, which are passed between programs (through the kernel) as another array of C strings.
As a corollary, certain character set encodings really don't work as locales on Unix because they run into this. Any character set encoding that can generate zero bytes as part of its characters is going to have serious problems with filenames and command line arguments; one obvious example of such a character set is UTF-16. I believe the usual way for Unixes to deal with a filesystem that's natively UCS-2 or UTF-16 is to encode and decode to UTF-8 somewhere in the kernel or the filesystem driver itself.