A surprising reason grep may think a file is a binary file

April 21, 2017

Recently, 'fgrep THING FILE' for me has started to periodically report 'Binary file FILE matches' for files that are not in fact binary files. At first I thought a stray binary character might have snuck into one file this was happening to, because it's a log file that accumulates data partly from the Internet, but then it happened to a file that is only hand-edited and that definitely shouldn't contain any binary data. I spent a chunk of time tonight trying to find the binary characters or mis-encoded UTF-8 or whatever it might be in the file, before I did the system programmer thing and just fetched the Fedora debuginfo package for GNU grep so that I could read the source and set breakpoints.

(I was encouraged into this course of action by this Stackexchange question and answers, which quoted some of grep's source code and in the process gave me a starting point.)

As this answer notes, there are two cases where grep thinks your file is binary: if there's an encoding error detected, or if it detects some NUL bytes. Both of these sound at least conceptually simple, but it turns out that grep tries to be clever about detecting NULs. Not only does it scan the buffers that it reads for NULs, but it also attempts to see if it can determine that a file must have NULs in the remaining data, in a function helpfully called file_must_have_nulls.

You might wonder how grep or anything can tell if a file has NULs in the remaining data. Let me answer that with a comment from the source code:

/* If the file has holes, it must contain a null byte somewhere. */

Reasonably modern versions of Linux (since kernel 3.1) have some special additional lseek() options, per the manpage. One of them is SEEK_HOLE, which seeks to the nearest 'hole' in the file. Holes are unwritten data and Unix mandates that they read as NUL bytes, so if a file has holes, it's got NULs and so grep will call it a binary file.

SEEK_HOLE is not implemented on all filesystems. More to the point, the implementation of SEEK_HOLE may not be error-free on all filesystems all of the time. In my particular case, the files which are being unexpected reported as binary are on ZFS on Linux, and it appears that under some mysterious circumstances the latest development version of ZoL can report that there are holes in a file when there aren't. It appears that there is a timing issue, but strace gave me a clear smoking gun and I managed to reproduce it in a simple test program that gives me a clear trace:

open("testfile", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0600, st_size=33005, ...}) = 0
read(3, "aaaaaaaaa"..., 32768) = 32768
lseek(3, 32768, SEEK_HOLE)              = 32768

The file doesn't have any holes, yet sometimes it's being reported as having one at the exact current offset (and yes, the read() is apparently important to reproduce the issue).

(Interested parties can see more weirdness in the ZFS on Linux issue.)

Written on 21 April 2017.
« The big motivation for a separate /boot partition
Link: Rob Landley's Linux Memory Management Frequently Asked Questions »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Apr 21 00:57:12 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.