If you use GNU Grep on text files, use the -a (--text) option

April 13, 2020

Today, I happened to notice that one of my email log scanning scripts wasn't reporting on a log entry that I knew was there (because another, related script was reporting it). My log scanning script starts out with a grep to filter out some things I don't want to include:

grep -hv 'a specific pattern' "$@" | exigrep '...' | [...]

I had all sorts of paranoid thoughts about whether I had misunderstood exactly what the -v option did, or if exigrep was doing something peculiar, and so on. But eventually I ran the grep itself alone on the file, piped to less, and jumped to the end in less because I happened to know that the missing entry was relatively late in the file. What I was expecting to happen is that the grep output would just stop at some point. What I actually found was simple:

2020-04-13 16:07:06 H=(111iu.com) [223.165.241.9] [...]
2020-04-13 16:07:07 unexpected disconnection [...]
Binary file /var/log/exim4/mainlog matches

Ah. Yes. How helpful. While reading along in what it had up until then thought was a text file, GNU Grep encountered some funny characters (in a DKIM signature information line, as it happened) and decided that the file was actually binary and so it wouldn't report anything more for the rest of the file than that final line.

(This is a different and much more straightforward cause than the time GNU Grep thought some text files were binary because of a filesystem bug combined with its clever tricks.)

I generally like the GNU versions of standard Unix utilities and the things that they've added, but this is not one of them, especially when GNU Grep's output is not going to a terminal. Especially if it starts out initially printing out text lines, it should continue to do so rather than surprise people this way.

The valuable learning experience here is that any time I'm processing a text file with GNU Grep (which is pretty much all of the time in my scripts), I should explicitly force it to always treat things as text. This is unfortunately going to make some scripts more awkward, because sometimes I have pipelines with several greps involved as text is filtered and manipulated. Either I spray '-a' over all of the greps or I try to figure out what minimal LC_<something> environment variable will turn this off, or I reach for the gigantic hammer of 'LC_ALL=C' (as suggested by the GNU Grep manpage).

PS: This is not just a Linux issue because GNU Grep appears on more than just Linux machines, depending on what you install and what you add to your path. A FreeBSD machine I have access to uses GNU Grep as /usr/bin/grep, for example.


Comments on this page:

By Ben Hutchings at 2020-04-14 13:54:02:

I believe a 0 byte will always trigger the "binary file" detection, so there is no locale setting you can use to avoid this.

By Todd at 2020-04-16 11:03:13:

...or use GREP_OPTIONS at the beginning of your script.

By John Wiersba at 2020-04-16 16:14:52:
$ GREP_OPTIONS="-d skip -Ea" /bin/grep ipsum *
/bin/grep: warning: GREP_OPTIONS is deprecated; please use an alias or script

$ /bin/grep --version | head -1
/bin/grep (GNU grep) 2.25

@Todd I understand why the warning is there, but I wish there were an option to turn off the deprecation warning.

By nknight at 2020-04-21 13:50:42:

I think it is worth considering using LC_ALL=C in your scripts. POSIXish shell scripts are frankly an abomination, an unclean mixing of interactive and programmatic features with no clear separations. One of the things that bites you is that "functions" (in this case grep) can alter their behaviour seemingly at random based on your locale.

This isn't limited to misidentifying plaintext. Consider if you had a script that generated a sequence of dates as part of some output. If you run it in the US, maybe you get 04/20/2020. But if someone in Europe runs it, do they get 20/04/2020?

Forcing LC_ALL=C forces (more) consistent behaviour. Without it, it's really hard to know what results you might get on a given system.

By Polyna at 2020-04-22 12:55:49:

or use GREP_OPTIONS at the beginning of your script.

No. It’s deprecated, and for a good reason.

The correct way to always pass the same options to a command in a script is to define a function:

grep() { command grep -a "$@"; }

For people that uses grep can be interesting ripgrep (it's written in Rust).

ripgrep is a line-oriented search tool that recursively searches your current directory for a regex pattern.

By default, ripgrep will respect your .gitignore and automatically skip hidden files/directories and binary files.

ripgrep has first class support on Windows, macOS and Linux, with binary downloads available for every release.

ripgrep is similar to other popular search tools like The Silver Searcher, ack and grep.

https://github.com/BurntSushi/ripgrep

Written on 13 April 2020.
« The appeal of doing exact string comparisons with Apache's RewriteCond
We're (temporarily) moving to three way mirrored disks on our servers »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 13 21:26:54 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.