A surprising reason grep
may think a file is a binary file
Recently, 'fgrep THING FILE
' for me has started to periodically
report 'Binary file FILE matches' for files that are not in fact
binary files. At first I thought a stray binary character might
have snuck into one file this was happening to, because it's a log
file that accumulates data partly from the Internet, but then it
happened to a file that is only hand-edited and that definitely
shouldn't contain any binary data. I spent a chunk of time tonight
trying to find the binary characters or mis-encoded UTF-8 or whatever
it might be in the file, before I did the system programmer thing and just fetched the
Fedora debuginfo package for GNU grep so that I could read the
source and set breakpoints.
(I was encouraged into this course of action by this Stackexchange question and answers, which quoted some of grep's source code and in the process gave me a starting point.)
As this answer notes,
there are two cases where grep thinks your file is binary: if there's
an encoding error detected, or if it detects some NUL bytes. Both
of these sound at least conceptually simple, but it turns out that
grep tries to be clever about detecting NULs. Not only does it scan
the buffers that it reads for NULs, but it also attempts to see if
it can determine that a file must have NULs in the remaining data,
in a function helpfully called file_must_have_nulls
.
You might wonder how grep
or anything can tell if a file has NULs
in the remaining data. Let me answer that with a comment from the
source code:
/* If the file has holes, it must contain a null byte somewhere. */
Reasonably modern versions of Linux (since kernel 3.1) have some
special additional lseek()
options, per the manpage. One of them
is SEEK_HOLE
, which seeks to the nearest 'hole' in the file.
Holes are unwritten data and Unix mandates that they read as NUL
bytes, so if a file has holes, it's got NULs and so grep
will
call it a binary file.
SEEK_HOLE
is not implemented on all filesystems. More to the
point, the implementation of SEEK_HOLE
may not be error-free
on all filesystems all of the time. In my particular case, the files
which are being unexpected reported as binary are on ZFS on Linux, and it appears that under some mysterious
circumstances the latest development version of ZoL can report that
there are holes in a file when there aren't. It appears that there
is a timing issue, but strace
gave me a clear smoking gun and I
managed to reproduce it in a simple test program that gives me a
clear trace:
open("testfile", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0600, st_size=33005, ...}) = 0 read(3, "aaaaaaaaa"..., 32768) = 32768 lseek(3, 32768, SEEK_HOLE) = 32768
The file doesn't have any holes, yet sometimes it's being reported
as having one at the exact current offset (and yes, the read()
is apparently important to reproduce the issue).
(Interested parties can see more weirdness in the ZFS on Linux issue.)
Comments on this page:
|
|