tarfile module is too generous about what is considered a tar file
The Python standard library's
tarfile module has a
function that tells you whether or not some file is a tar file, or
at least is a tar file that the module can read. As is not too silly
in Python, it operates by attempting to open the file with
open() succeeds, clearly this is a good tarfile.
Unfortunately, through what is perhaps a bug, this fails to report any errors on various sorts of things that are not actually tar files. On a Unix system, the very easiest and simplest reproduction of this problem is:
>>> import tarfile >>> tarfile.open("/dev/zero", "r:")
This raises no exception and gives you back a TarFile object that will report that you have an empty tar file.
(If you leave off the '
r:', this hangs, ultimately because the
lzma module will
happily read forever from a stream of zero bytes. Unless you tell
it otherwise, the tarfile module normally tries a sequences of
decompressors on your potential tarfile, including lzma for
One specific form of thing that will cause this issue is any nominal
'tar file' that starts with 512 bytes of zero bytes (after any
decompression is applied). Since this applies to
have our handy and obviously incorrect reproduction case. There may
be other initial 512-byte blocks that will cause this; I have not
investigated the code deeply, partly because it is tangled.
I suspect that this is a bug in the
TarFile.next function, which
looks like it is missing an '
elif self.offset == 0:' clause (see
the block of code starting around here). But
whether or not this issue is a bug and will be fixed in a future
version of Python 3, it is very widespread in existing versions of
Python that are out there in the field, and so any code that cares
about this (which we have some of) needs to
cope with it.
My current hack workaround is to check whether or not the
list on the returned TarFile object is empty. This is not a documented
attribute, but it's unlikely to change and it works today (and feels
slightly less sleazy than checking whether
(For reasons beyond the scope of this entry, I have decided to slog through the effort of finding how to submit Python bug reports, unearthing my login from the last time I threw a bug report into their issue tracker, and filing a version of this as issue 36596.)
A Git tool that I'd like and how I probably use Git differently from most people
For a long time now, I've wished for what has generally seemed
to me like a fairly straightforward and obvious Git tool. What I
want is a convenient way to page through all of the different
versions of a file over time, going 'forward' and 'backward' through
them. Basically this would be the whole file version of '
-p FILE', although it couldn't have the same interface.
(I know that the history may not be linear. There are various ways to cope with this, depending on how sophisticated an interface you're presenting.)
When I first started wanting this, it felt so obvious that I couldn't believe it didn't already exist. Going through past versions of a file was something that I wanted to do all the time when I was digging through repositories, and I didn't get why no one else had created this. Now, though, I think that my unusual desire for this is one of the signs that I use Git repositories differently from most people, because I'm coming at them as a sysadmin instead of as a developer. Or, to put it another way, I'm reading code as an outsider instead of an insider.
When you're an insider to code, when you work on the code in the
repository you're reading, you have enough context to readily
understand diffs and so '
git log -p' and
similar diff-based formats (such as '
git show' of a commit) are
perfectly good for letting you understand what the code did in the
past. But I almost never have that familiarity with a Git repo I'm
investigating. I barely know the current version of the file, the
one I can read in full in the repo; I completely lack the contextual
knowledge to mentally apply a diff and read out the previous behavior
of the code. To understand the previous behavior of the code, I
need to read the full previous code. So I wind up wanting a convenient
way to get that previous version of a file and to easily navigate
(There are a surprising number of circumstances where understanding something about the current version of a piece of code requires me to look at what it used to do.)
I rather suspect that most people using Git are developers instead of people spelunking the depths of unfamiliar codebases. Developers likely don't have much use for viewing full versions of a file over time (or at least it's not a common need), so it's probably not surprising that there doesn't seem to be a tool for this (or at least not an easily found one).
(Github has something that comes close to this, with the 'view blame prior to this change' feature in its blame view of a particular file. But this is not quite the same thing, although it is handy for my sorts of investigations.)