Using grep to hunt around for null bytes in text files

May 3, 2018

Suppose, not entirely hypothetically, that you've developed a suspicion that some mailbox files have zero (null) bytes in them. Null bytes are not traditionally found in mail messages, or indeed in any text format, and their presence is not infrequently a sign that something has gone wrong (in email related things, an obvious suspect is locking issues). So if you suspect null bytes might be lurking you want to at least check for them and perhaps count them, and once you've found some null bytes you may want to know more about the context they occur in.

On most modern Unixes, searching for null bytes is most easily done with GNU grep's 'grep -P' option, which lets you supply a Perl-style regular expression that can include a direct byte value:

grep -l -P '\x00' ....

If your version of grep doesn't support -P (FreeBSD's doesn't), you'll have to investigate more elaborate approaches. The general problem is that a lot of things will interpret a literal null byte as a C end-of-string; you need to find a way to supply one to your grep that doesn't fall victim to this, and then hope your (e)grep does the right thing.

(Really, it might be easier to get and compile the latest GNU grep just for this.)

Once you've found files with null bytes, you might want to do things like count how many null bytes you have and in how many different places. As I found out recently, modern versions of awk are perfectly happy about null bytes in their input, which makes life reasonably easy when combined with 'grep -o':

grep -ao -P '\x00+' FILE |
 awk '{cnt += 1; tlen += length($0)}
      END {print cnt, tlen}'

(More elaborate analysis is up to you. I looked at shortest, longest, and average size.)

To really dig into what's going on, you need to see the context that these null bytes occur in. In an ideal world, less would let you search for null bytes, so you could just do 'less FILE', find the first null byte, and go look around as usual. Unfortunately less has no such feature as far as I know, and neither does any other pager. I will save you the effort and say that the easiest way to do this is to use grep, telling it to provide some context:

grep -a -n -C 1 -P '\x00' FILE | less

It's worth breaking down why we're doing all of this. First, we're feeding the grep's output to less because less will actually show us the null byte or bytes, instead of silently not printing it. Grep's -P argument we're already familiar with. -a forces grep to consider the file printable, despite the null bytes. -C N is how many lines of context we want before and after the line with null bytes. Finally, -n prints the line numbers involved. We want the line numbers because with them, we can do 'less FILE' and then jump to the spot with the null bytes using the line number.

When looking at the output from less here, remember that null bytes will be printed as two characters (as '^@') even though they're only a single character. Where this came up for me was that one of our null bytes was in some base64 lines, and initially I was going to say that the null byte was clearly an addition to it because the line with it in was a character longer than the other base64 lines around it. Then I realized this expansion, and thus that our null byte had replaced a base64 character instead of being inserted between two.

(Unfortunately all of this looking brought us not much closer to having some idea of why null bytes are showing up in people's mailboxes, although there's indications that some of them may have been inserted by the original sender and then passed intact through Exim. What I haven't done yet and should do is actually test how various elements of our overall mail system behave when fed SMTP messages with null bytes, although that would only tell me the system's current behavior, not what it used to do several or many years ago, which is when some of the nulls apparently date from.)

PS: I've tried loading the mailbox file into an editor to use it to search for nulls. For a sufficiently large mailbox, the result didn't go all that well, and I had to worry about inadvertently modifying the file. Perhaps there is an 'editor' that is efficient for this, but if so I don't have it lying around, while grep is right there.

(I believe I got the grep -P '\x00' trick from Stackoverflow, where it's shown up in a number of answers.)

Written on 03 May 2018.
« An interaction of low ZFS recordsize, compression, and advanced format disks
Why you can't put zero bytes in Unix command line arguments »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu May 3 00:38:58 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.