Wandering Thoughts archives

2019-04-10

The tarfile module is too generous about what is considered a tar file

The Python standard library's tarfile module has a tarfile.is_tarfile function that tells you whether or not some file is a tar file, or at least is a tar file that the module can read. As is not too silly in Python, it operates by attempting to open the file with tarfile.open; if open() succeeds, clearly this is a good tarfile.

Unfortunately, through what is perhaps a bug, this fails to report any errors on various sorts of things that are not actually tar files. On a Unix system, the very easiest and simplest reproduction of this problem is:

>>> import tarfile
>>> tarfile.open("/dev/zero", "r:")

This raises no exception and gives you back a TarFile object that will report that you have an empty tar file.

(If you leave off the 'r:', this hangs, ultimately because the lzma module will happily read forever from a stream of zero bytes. Unless you tell it otherwise, the tarfile module normally tries a sequences of decompressors on your potential tarfile, including lzma for .xz files.)

One specific form of thing that will cause this issue is any nominal 'tar file' that starts with 512 bytes of zero bytes (after any decompression is applied). Since this applies to /dev/zero, we have our handy and obviously incorrect reproduction case. There may be other initial 512-byte blocks that will cause this; I have not investigated the code deeply, partly because it is tangled.

I suspect that this is a bug in the TarFile.next function, which looks like it is missing an 'elif self.offset == 0:' clause (see the block of code starting around here). But whether or not this issue is a bug and will be fixed in a future version of Python 3, it is very widespread in existing versions of Python that are out there in the field, and so any code that cares about this (which we have some of) needs to cope with it.

My current hack workaround is to check whether or not the .members list on the returned TarFile object is empty. This is not a documented attribute, but it's unlikely to change and it works today (and feels slightly less sleazy than checking whether .firstmember is None).

(For reasons beyond the scope of this entry, I have decided to slog through the effort of finding how to submit Python bug reports, unearthing my login from the last time I threw a bug report into their issue tracker, and filing a version of this as issue 36596.)

python/TarfileTooGenerous written at 22:12:58; Add Comment

A Git tool that I'd like and how I probably use Git differently from most people

For a long time now, I've wished for what has generally seemed to me like a fairly straightforward and obvious Git tool. What I want is a convenient way to page through all of the different versions of a file over time, going 'forward' and 'backward' through them. Basically this would be the whole file version of 'git log -p FILE', although it couldn't have the same interface.

(I know that the history may not be linear. There are various ways to cope with this, depending on how sophisticated an interface you're presenting.)

When I first started wanting this, it felt so obvious that I couldn't believe it didn't already exist. Going through past versions of a file was something that I wanted to do all the time when I was digging through repositories, and I didn't get why no one else had created this. Now, though, I think that my unusual desire for this is one of the signs that I use Git repositories differently from most people, because I'm coming at them as a sysadmin instead of as a developer. Or, to put it another way, I'm reading code as an outsider instead of an insider.

When you're an insider to code, when you work on the code in the repository you're reading, you have enough context to readily understand diffs and so 'git log -p' and similar diff-based formats (such as 'git show' of a commit) are perfectly good for letting you understand what the code did in the past. But I almost never have that familiarity with a Git repo I'm investigating. I barely know the current version of the file, the one I can read in full in the repo; I completely lack the contextual knowledge to mentally apply a diff and read out the previous behavior of the code. To understand the previous behavior of the code, I need to read the full previous code. So I wind up wanting a convenient way to get that previous version of a file and to easily navigate through versions.

(There are a surprising number of circumstances where understanding something about the current version of a piece of code requires me to look at what it used to do.)

I rather suspect that most people using Git are developers instead of people spelunking the depths of unfamiliar codebases. Developers likely don't have much use for viewing full versions of a file over time (or at least it's not a common need), so it's probably not surprising that there doesn't seem to be a tool for this (or at least not an easily found one).

(Github has something that comes close to this, with the 'view blame prior to this change' feature in its blame view of a particular file. But this is not quite the same thing, although it is handy for my sorts of investigations.)

programming/GitViewFileOverTimeWish written at 01:27:06; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.