2018-05-03
Using grep
to hunt around for null bytes in text files
Suppose, not entirely hypothetically, that you've developed a suspicion that some mailbox files have zero (null) bytes in them. Null bytes are not traditionally found in mail messages, or indeed in any text format, and their presence is not infrequently a sign that something has gone wrong (in email related things, an obvious suspect is locking issues). So if you suspect null bytes might be lurking you want to at least check for them and perhaps count them, and once you've found some null bytes you may want to know more about the context they occur in.
On most modern Unixes, searching for null bytes is most easily
done with GNU grep's 'grep -P
' option, which lets you supply
a Perl-style regular expression that can include a direct byte
value:
grep -l -P '\x00' ....
If your version of grep doesn't support -P
(FreeBSD's doesn't),
you'll have to investigate more elaborate approaches. The general
problem is that a lot of things will interpret a literal null byte
as a C end-of-string; you need to find a way to supply one to your
grep
that doesn't fall victim to this, and then hope your (e)grep
does the right thing.
(Really, it might be easier to get and compile the latest GNU grep just for this.)
Once you've found files with null bytes, you might want to do things
like count how many null bytes you have and in how many different
places. As I found out recently, modern versions of awk are perfectly
happy about null bytes in their input, which makes life reasonably
easy when combined with 'grep -o
':
grep -ao -P '\x00+' FILE | awk '{cnt += 1; tlen += length($0)} END {print cnt, tlen}'
(More elaborate analysis is up to you. I looked at shortest, longest, and average size.)
To really dig into what's going on, you need to see the context
that these null bytes occur in. In an ideal world, less
would let
you search for null bytes, so you could just do 'less FILE
', find
the first null byte, and go look around as usual. Unfortunately
less
has no such feature as far as I know, and neither does any
other pager. I will save you the effort and say that the easiest
way to do this is to use grep
, telling it to provide some context:
grep -a -n -C 1 -P '\x00' FILE | less
It's worth breaking down why we're doing all of this. First, we're
feeding the grep
's output to less
because less
will actually
show us the null byte or bytes, instead of silently not printing
it. Grep's -P
argument we're already familiar with. -a
forces
grep to consider the file printable, despite the null bytes. -C
N
is how many lines of context we want before and after the line
with null bytes. Finally, -n
prints the line numbers involved.
We want the line numbers because with them, we can do 'less FILE
'
and then jump to the spot with the null bytes using the line number.
When looking at the output from less
here, remember that null
bytes will be printed as two characters (as '^@
') even though
they're only a single character. Where this came up for me was that
one of our null bytes was in some base64 lines, and initially I was
going to say that the null byte was clearly an addition to it because
the line with it in was a character longer than the other base64
lines around it. Then I realized this expansion, and thus that our
null byte had replaced a base64 character instead of being inserted
between two.
(Unfortunately all of this looking brought us not much closer to having some idea of why null bytes are showing up in people's mailboxes, although there's indications that some of them may have been inserted by the original sender and then passed intact through Exim. What I haven't done yet and should do is actually test how various elements of our overall mail system behave when fed SMTP messages with null bytes, although that would only tell me the system's current behavior, not what it used to do several or many years ago, which is when some of the nulls apparently date from.)
PS: I've tried loading the mailbox file into an editor to use it
to search for nulls. For a sufficiently large mailbox, the result
didn't go all that well, and I had to worry about inadvertently
modifying the file. Perhaps there is an 'editor' that is efficient
for this, but if so I don't have it lying around, while grep
is
right there.
(I believe I got the grep -P '\x00'
trick from Stackoverflow,
where it's shown up in a number of answers.)