The tarfile
module is too generous about what is considered a tar file
The Python standard library's tarfile
module has a
tarfile.is_tarfile
function that tells you whether or not some file is a tar file, or
at least is a tar file that the module can read. As is not too silly
in Python, it operates by attempting to open the file with
tarfile.open
; if
open()
succeeds, clearly this is a good tarfile.
Unfortunately, through what is perhaps a bug, this fails to report any errors on various sorts of things that are not actually tar files. On a Unix system, the very easiest and simplest reproduction of this problem is:
>>> import tarfile >>> tarfile.open("/dev/zero", "r:")
This raises no exception and gives you back a TarFile object that will report that you have an empty tar file.
(If you leave off the 'r:
', this hangs, ultimately because the
lzma
module will
happily read forever from a stream of zero bytes. Unless you tell
it otherwise, the tarfile module normally tries a sequences of
decompressors on your potential tarfile, including lzma for .xz
files.)
One specific form of thing that will cause this issue is any nominal
'tar file' that starts with 512 bytes of zero bytes (after any
decompression is applied). Since this applies to /dev/zero
, we
have our handy and obviously incorrect reproduction case. There may
be other initial 512-byte blocks that will cause this; I have not
investigated the code deeply, partly because it is tangled.
I suspect that this is a bug in the TarFile.next
function, which
looks like it is missing an 'elif self.offset == 0:
' clause (see
the block of code starting around here). But
whether or not this issue is a bug and will be fixed in a future
version of Python 3, it is very widespread in existing versions of
Python that are out there in the field, and so any code that cares
about this (which we have some of) needs to
cope with it.
My current hack workaround is to check whether or not the .members
list on the returned TarFile object is empty. This is not a documented
attribute, but it's unlikely to change and it works today (and feels
slightly less sleazy than checking whether .firstmember
is None
).
(For reasons beyond the scope of this entry, I have decided to slog through the effort of finding how to submit Python bug reports, unearthing my login from the last time I threw a bug report into their issue tracker, and filing a version of this as issue 36596.)
|
|