How I wound up finding a bug in GNU Tar
Every so often something happens in my work that makes me think, even if I don't know what conclusions to really draw from it. I recently mentioned that we'd found a bug in GNU Tar, and the story of how that happened is one of those times.
We back up our fileservers
through Amanda and GNU Tar. For a long time, we've had a problem
where every so often, fortunately quite rarely, tar would freak out
while backing up the filesystem that held
huge amounts of output. Most of the time this would go on forever
and we'd have to kill the dump eventually; other times it would
eventually finish, having produced terabyte(s) of output that
fortunately seemed to compress very well. At one point we captured
such a giant tar file and I subjected it to some inspection, which
revealed that the runaway area was a giant sea of null bytes, which
tar -t' didn't like, but after a while things returned to normal.
(This led to me wondering if null bytes were naturally occurring in people's inboxes. It turns out that hunting for null bytes in text files is not quite as easy as you'd like, and yes, people's inboxes have some.)
We recently moved the filesystem with
/var/mail to our new Linux
fileservers, which are on Ubuntu 18.04
and so have a more recent and more mainline version of GNU Tar than
our OmniOS machines. We hoped that this would solve our GNU Tar
issues, but then we almost immediately had one of these runaway tar
incidents occur. This time around, with GNU Tar running on an Ubuntu
machine where I felt fully familiar with all of the debugging tools
available, I did some inspection of the running
tar process. This
inspection revealed that
tar was issuing an endless stream of
read()s that were all returning 0 bytes:
read(6, "", 512) = 0 read(6, "", 512) = 0 [...] read(6, "", 512) = 0 write(1, "\0\0\0\0\0"..., 10240) = 10240 read(6, "", 512) = 0 [...]
lsof said that file descriptor 6 was someone's mailbox.
apt-get source tar', I fetched the source code to Ubuntu's
version of GNU Tar and went rummaging around through it for
system calls that didn't check for end of file. Once I decoded some
levels of indirection, there turned out to be one obvious place that
seemed to skip it, in the
sparse_dump_region function in sparse.cs.
A little light went on in my head.
A few months ago, we ran into a NFS problem with Alpine. While working on that bug, I
an Alpine process and noticed, among other things, that it was using
ftruncate() to change the size of mailboxes; sometimes it extended
them, temporarily creating a sparse section of the file until it
filled it in, and perhaps sometimes it shrunk them too. This seemed
to match what I'd spotted; sparseness was related, and shrinking a
file's size with
ftruncate() would create a situation where tar
hit end of file before it was expecting to.
(This even provides an explanation for why tar sometimes recovered; if something later delivered more mail to the mailbox, taking it back to or above the size tar expected, tar would stop getting this unexpected end of file.)
I did some poking around in GDB, using Ubuntu's debugging symbols
and the tar package source code I'd fetched, and I can reproduce
the bug, although it's somewhat different than my initial theory.
It turns out that
sparse_dump_region is not dumping sparse
regions of a file, it's dumping non-sparse ones (of course), and
it's used on all files (sparse or not) if you run tar with the
--sparse argument. So the actual bug is if you run GNU Tar with
--sparse and a file shrinks while tar is reading it, tar fails
to properly handle the resulting earlier than expected end of file.
If the file grows again, tar recovers.
(Except if a file that is sparse at the end shrinks purely in that sparse section. In that case you're okay.)
What is interesting to me about this is that there's nothing here
I could not have done years ago on our OmniOS fileservers, in theory.
OmniOS has ways of tracing a program's system call activity, and
it has general equivalents of
lsof, and I could have probably
found and looked at the source code for its version of GNU Tar and
run it under some OmniOS debugger (although we don't seem to have
any version of GDB installed), and so on. But I didn't. Instead we
shrugged a bit and moved on. It took moving this filesystem to an
Ubuntu based environment to get me to dig into the issue.
(It wasn't just an issue of tools and environment, either; part of it was that we automatically assumed that the OmniOS version of GNU Tar was some old unsupported version that there was no reason to look at, because surely the issue was fixed in a newer one.)
PS: Our short term fix is likely to be to tell Amanda to run GNU
--sparse when backing up this filesystem. Mailboxes
shouldn't be sparse, and if they are we're compressing this
filesystem's backups anyway so all those null bytes
will compress really well.
PPS: I haven't tried to report this as a bug to the GNU Tar people because I only confirmed it Friday and the university is now on its winter break. Interested parties should feel free to beat me to it.
Update: The bug has been reported to the GNU Tar people and is now fixed in commit c15c42c.