2020-04-13
If you use GNU Grep on text files, use the -a (--text) option
Today, I happened to notice that one of my email log scanning scripts
wasn't reporting on a log entry that I knew was there (because another,
related script was reporting it). My log scanning script starts out with
a grep
to filter out some things I don't want to include:
grep -hv 'a specific pattern' "$@" | exigrep '...' | [...]
I had all sorts of paranoid thoughts about whether I had misunderstood
exactly what the -v option did, or if exigrep
was doing something peculiar, and so on. But eventually I ran the
grep
itself alone on the file, piped to less
, and jumped to the
end in less
because I happened to know that the missing entry was
relatively late in the file. What I was expecting to happen is that
the grep
output would just stop at some point. What I actually
found was simple:
2020-04-13 16:07:06 H=(111iu.com) [223.165.241.9] [...] 2020-04-13 16:07:07 unexpected disconnection [...] Binary file /var/log/exim4/mainlog matches
Ah. Yes. How helpful. While reading along in what it had up until then thought was a text file, GNU Grep encountered some funny characters (in a DKIM signature information line, as it happened) and decided that the file was actually binary and so it wouldn't report anything more for the rest of the file than that final line.
(This is a different and much more straightforward cause than the time GNU Grep thought some text files were binary because of a filesystem bug combined with its clever tricks.)
I generally like the GNU versions of standard Unix utilities and the things that they've added, but this is not one of them, especially when GNU Grep's output is not going to a terminal. Especially if it starts out initially printing out text lines, it should continue to do so rather than surprise people this way.
The valuable learning experience here is that any time I'm processing
a text file with GNU Grep (which is pretty much all of the time in
my scripts), I should explicitly force it to always treat things as
text. This is unfortunately going to make some scripts more awkward,
because sometimes I have pipelines with several greps involved as text
is filtered and manipulated. Either I spray '-a
' over all of the greps
or I try to figure out what minimal LC_<something> environment
variable will turn this off, or I reach for the gigantic hammer of
'LC_ALL=C
' (as suggested by the GNU Grep manpage).
PS: This is not just a Linux issue because GNU Grep appears on more
than just Linux machines, depending on what you install and what you
add to your path. A FreeBSD machine I have access to uses GNU Grep
as /usr/bin/grep
, for example.