The tarfile module is too generous about what is considered a tar file

April 10, 2019

The Python standard library's tarfile module has a tarfile.is_tarfile function that tells you whether or not some file is a tar file, or at least is a tar file that the module can read. As is not too silly in Python, it operates by attempting to open the file with; if open() succeeds, clearly this is a good tarfile.

Unfortunately, through what is perhaps a bug, this fails to report any errors on various sorts of things that are not actually tar files. On a Unix system, the very easiest and simplest reproduction of this problem is:

>>> import tarfile
>>>"/dev/zero", "r:")

This raises no exception and gives you back a TarFile object that will report that you have an empty tar file.

(If you leave off the 'r:', this hangs, ultimately because the lzma module will happily read forever from a stream of zero bytes. Unless you tell it otherwise, the tarfile module normally tries a sequences of decompressors on your potential tarfile, including lzma for .xz files.)

One specific form of thing that will cause this issue is any nominal 'tar file' that starts with 512 bytes of zero bytes (after any decompression is applied). Since this applies to /dev/zero, we have our handy and obviously incorrect reproduction case. There may be other initial 512-byte blocks that will cause this; I have not investigated the code deeply, partly because it is tangled.

I suspect that this is a bug in the function, which looks like it is missing an 'elif self.offset == 0:' clause (see the block of code starting around here). But whether or not this issue is a bug and will be fixed in a future version of Python 3, it is very widespread in existing versions of Python that are out there in the field, and so any code that cares about this (which we have some of) needs to cope with it.

My current hack workaround is to check whether or not the .members list on the returned TarFile object is empty. This is not a documented attribute, but it's unlikely to change and it works today (and feels slightly less sleazy than checking whether .firstmember is None).

(For reasons beyond the scope of this entry, I have decided to slog through the effort of finding how to submit Python bug reports, unearthing my login from the last time I threw a bug report into their issue tracker, and filing a version of this as issue 36596.)

Written on 10 April 2019.
« A Git tool that I'd like and how I probably use Git differently from most people
Getting (and capturing) spam can sometimes be useful to see what's in it »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Apr 10 22:12:58 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.