Using grep to hunt around for null bytes in text files

May 3, 2018

Suppose, not entirely hypothetically, that you've developed a suspicion that some mailbox files have zero (null) bytes in them. Null bytes are not traditionally found in mail messages, or indeed in any text format, and their presence is not infrequently a sign that something has gone wrong (in email related things, an obvious suspect is locking issues). So if you suspect null bytes might be lurking you want to at least check for them and perhaps count them, and once you've found some null bytes you may want to know more about the context they occur in.

On most modern Unixes, searching for null bytes is most easily done with GNU grep's 'grep -P' option, which lets you supply a Perl-style regular expression that can include a direct byte value:

grep -l -P '\x00' ....

If your version of grep doesn't support -P (FreeBSD's doesn't), you'll have to investigate more elaborate approaches. The general problem is that a lot of things will interpret a literal null byte as a C end-of-string; you need to find a way to supply one to your grep that doesn't fall victim to this, and then hope your (e)grep does the right thing.

(Really, it might be easier to get and compile the latest GNU grep just for this.)

Once you've found files with null bytes, you might want to do things like count how many null bytes you have and in how many different places. As I found out recently, modern versions of awk are perfectly happy about null bytes in their input, which makes life reasonably easy when combined with 'grep -o':

grep -ao -P '\x00+' FILE |
 awk '{cnt += 1; tlen += length($0)}
      END {print cnt, tlen}'

(More elaborate analysis is up to you. I looked at shortest, longest, and average size.)

To really dig into what's going on, you need to see the context that these null bytes occur in. In an ideal world, less would let you search for null bytes, so you could just do 'less FILE', find the first null byte, and go look around as usual. Unfortunately less has no such feature as far as I know, and neither does any other pager. I will save you the effort and say that the easiest way to do this is to use grep, telling it to provide some context:

grep -a -n -C 1 -P '\x00' FILE | less

It's worth breaking down why we're doing all of this. First, we're feeding the grep's output to less because less will actually show us the null byte or bytes, instead of silently not printing it. Grep's -P argument we're already familiar with. -a forces grep to consider the file printable, despite the null bytes. -C N is how many lines of context we want before and after the line with null bytes. Finally, -n prints the line numbers involved. We want the line numbers because with them, we can do 'less FILE' and then jump to the spot with the null bytes using the line number.

When looking at the output from less here, remember that null bytes will be printed as two characters (as '^@') even though they're only a single character. Where this came up for me was that one of our null bytes was in some base64 lines, and initially I was going to say that the null byte was clearly an addition to it because the line with it in was a character longer than the other base64 lines around it. Then I realized this expansion, and thus that our null byte had replaced a base64 character instead of being inserted between two.

(Unfortunately all of this looking brought us not much closer to having some idea of why null bytes are showing up in people's mailboxes, although there's indications that some of them may have been inserted by the original sender and then passed intact through Exim. What I haven't done yet and should do is actually test how various elements of our overall mail system behave when fed SMTP messages with null bytes, although that would only tell me the system's current behavior, not what it used to do several or many years ago, which is when some of the nulls apparently date from.)

PS: I've tried loading the mailbox file into an editor to use it to search for nulls. For a sufficiently large mailbox, the result didn't go all that well, and I had to worry about inadvertently modifying the file. Perhaps there is an 'editor' that is efficient for this, but if so I don't have it lying around, while grep is right there.

(I believe I got the grep -P '\x00' trick from Stackoverflow, where it's shown up in a number of answers.)


Comments on this page:

By Ewen McNeill at 2018-05-03 02:21:30:

A couple of other possibilities, for the record, which might be pre-installed:

  1. cat -v to show control characters as two character sequences might then allow searching for them with regular tools (eg, search for "\^@)", which does seem to work in eg, cat -v ... | less).

  2. perl could probably be used to get a more advanced regex language, and it's installed by default on most systems. Eg, "perl -ne 'if (/\000/) { print; }' ..." (which would need a bit more tweaking if you wanted more context than the literal line).

Both of those appear to work on OS X (FreeBSD-derived), and GNU tools on Linux. (In the past I've also done "hexdump -C ..." and looked for " 00 " in the hex output as well, although that's not especially efficient.)

Ewen

PS: FWIW, in the past when I've found NULs in .mbox files, they've ended up there as incompletely fsync() ed writes, but those are usually obvious as being a full sector worth. It sounds like maybe you're getting individual bytes, so I'm curious if you do figure out the actual source of them.

PPS: Your blog formatting seems to match the first pair of the three closing parenthesis, rather than last pair of the three closing parenthesis, unless I put a space (or presumably some other character) in.

By Ewen McNeill at 2018-05-03 02:23:43:

Bother, formatting typo. That first example should have been:

search for "\^@"

but ended up with one too many parenthesis in the markup....

Ewen

By David C at 2018-05-03 10:26:24:

I've found the "od" tool useful in this sort of situation - I haven't used it specifically for nulls, but I used it to hunt down character 160s that were being inserted by someone's "smart" editor instead of regular 32/space.

By skeeto at 2018-05-03 11:39:10:

My personal email archive (Maildir format) dates back to 2005, and I keep everything personally addressed to me, as well as every piece of spam since 2013 (for filter training and for curiosity). I was surprised to find not a single null byte in my entire archive, including the spam. I searched for other unusual bytes and discovered that 0xff appears only in spam, and in 1.6% of all my spam. That single byte might be a decent spam indicator.

In Python, it would be only a couple of lines. And showing context is also easy (example in IPython):

In [1]: with open("Mail/saved-messages", "rb") as bin:

...: data = bin.read()

...:

In [2]: print(data.count(0 ))

2

In [3]: data.index(0)

Out[3]: 3746545

In [4]: data[3746540:3746550]

Out[4]: b'49C01\x00\x00eR9'

You can of course make this as fancy as you like. If you have more zeros, you'd have to keep using index, changing the start parameter until it raises ValueError.

By Christopher Barts at 2018-05-03 18:48:06:

If your terminal supports it (xterm does for sure) you can type it directly with a little knowledge of ASCII: ^V is a quote character, so type that and then the control character you want. For 0x00, that's ^@ (which has been mentioned) or Ctrl-Space or ^` (Ctrl-backtick... the one at the far left of the row of numbers on a standard QWERY keyboard); that works well enough under Linux, I believe, but perhaps not on Windows.

ESR has a section explaining why those specific characters all make NUL in this context on his document "Things Every Hacker Once Knew".

In GNU Emacs (probably other Emacsen, if anyone even still uses other Emacsen at this point... I'm old) the quote character is ^Q.

By cks at 2018-05-03 20:15:11:

While I can type a zero byte to less with this method (either with or without quoting), less completely ignores it. It's actually actively doing so; / then ^V can be used to search for other special characters and will show them as you enter them, but if you do ^V then enter ^@, nothing shows (and it doesn't search for nulls).

(Less is handling input character by character in raw mode, so its handling of ^V is simply a convention that it implements itself.)

Written on 03 May 2018.
« An interaction of low ZFS recordsize, compression, and advanced format disks
Why you can't put zero bytes in Unix command line arguments »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu May 3 00:38:58 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.