Wandering Thoughts archives

2019-09-04

If you use the rarfile module, make sure you're using version 3.0 (or later)

We have a Python program to log mail attachment information for Exim. One of the things it does as part of looking at attachments is try to look inside various sorts of archives to find out what sort of things are in there (occasionally the answer is interesting). Back in the summer of 2016, I added the ability to look inside RAR archives using the rarfile module, in addition to ZIPs and tar files (for which I was using the standard library's modules). At the time our mail machines were running a mixture of Ubuntu 12.04 and Ubuntu 14.04, neither of which had the rarfile module available pre-packaged, but the recently released Ubuntu 16.04 packaged rarfile 2.7. Since the module itself is a single pure Python file, I just copied the 16.04 package's rarfile.py into our program's local source and left things there.

(I believe that rarfile 2.8 had been out for about a month at that point, but it didn't seem worth deviating from the Ubuntu version. At that point I was hoping to switch to the official Ubuntu package when we upgraded all of the mail machines to Ubuntu 16.04, so we could theoretically let Ubuntu worry about its version.)

Over time (starting no later than the fall of 2017), we noticed a slowly increasing number of MIME attachments with .rar extensions that we couldn't get the RAR archive contents for. Often our libmagic-based content sniffing (using the magic module) would say that these actually were what it thought was RAR archives, and frequently our commercial anti-spam system would detect malware in them. Recently this reached a tipping point (cf) where I decided to see if updating the rarfile module to the current version would improve the situation and let us look into more RAR archives.

The answer is yes. It turns out that there is a 'new' RAR archive format called RAR5, and rarfile added support for this format in version 3.0 (which was released at the end of 2016); before then, rarfile only supported the RAR3 format. Unsurprisingly, over time more and more RAR archives have been created using RAR5 format instead of RAR3 (although use of RAR3 is still surprisingly frequent in email attachments we get). To be able to read as many RAR archives as possible, you want rarfile 3.0 or later so it supports both RAR3 and RAR5 formats. Right now the 'or later' clause is not really important, since 3.0 is the latest released version.

(WinRAR started supporting RAR5 in late 2013, but my impression is that there are a lot of third party tools and third party RAR code out there. Apparently a fair amount of it has been slow to implement RAR5 or at least to default to it for new archives, much like the rarfile module.)

The rarfile module doesn't move very fast and it kept working for us in general, which is a large part of why I let it just sit there (and had we updated the mail machines to Ubuntu 18.04 and switched to the Ubuntu packaged version, we'd have automatically fixed the problem, as Ubuntu 18.04 packages 3.0). But it's an interesting experience in quietly outdated dependencies, where a more recent version would have improved our experience (and without us having to do anything).

Locking or otherwise freezing dependencies is a very common way to get stability and guarantee reproducible deployments, and that's very popular with a lot of people (me included). But what happened to us is the drawback of that stability, especially for those programs and apps that are complete and which thus have no natural ongoing changes that provide a push to at least check the state of dependencies.

PS: Updating to use rarfile 3.0 required no changes in our program, although we only use a very small portion of the module's capabilities. As far as I can tell, our code doesn't even notice whether the RAR archive is in RAR3 or RAR5 format.

python/UpdatingToRarfile30 written at 22:29:31; Add Comment

Using Wireshark's Statistics menu to get per-host traffic volume

As part of my casual Internet browsing, I recently read 6 Lessons we learned when debugging a scaling problem on GitLab.com. As sort of an aside (although listed as a lesson), the article mentioned Wireshark's Statistics menu and how it can show you per-conversation information (and thus let you find specific sorts of conversations, such as short ones). I didn't think about it much at the time, but this mention stuck in the back of my mind (as such things often do, at least for a while).

Today I had a situation where we had a saturated OpenBSD firewall and I very much wanted to find out roughly what hosts were responsible for the traffic. OpenBSD has per-interface statistics (which let me see that the firewall's interface was saturated with incoming traffic), but it doesn't have anything more granular by default and we didn't have any traffic accounting stuff set up in our PF rules. I tried a plain tcpdump, but this firewall sits in front of enough hosts that the output was overwhelming. As I was thinking unhappy thoughts about trying to write some awk on the fly, a little light went on; perhaps Wireshark could help. So I used tcpdump to capture a minute or two of traffic to a file, copied the capture file over to my Linux machine, and fired up Wireshark.

(Since I only cared about packet sizes, not packet contents, I was able to let tcpdump truncate packets to keep the file size down.)

The answer is yes, Wireshark absolutely had something that could help; the 'Endpoints' option on the Statistics menu gives you a breakdown of the traffic by various endpoint categories, including IPv4 hosts (it will also do it by host+port combination). This immediately pointed me to the high-volume hosts at work.

Using packet captures for this isn't necessarily as useful and precise as real traffic volume information that is measured directly and reliably by the host in some way, and it likely has more overhead. But it has the large virtue that we can use it in any situation where we can run tcpdump for a while, and almost everything has tcpdump. I can use it with our OpenBSD firewalls to find traffic sources, I can use it with our Linux fileservers to figure out which NFS clients are doing a high volume of read or write IO, and I'm sure I can use it in plenty of other situations too.

(One that just occurred to me is trying to find out who is doing an unusually large number of DNS queries to our DNS servers. We don't have query logging, but we can capture a couple of minutes of traffic to port 53.)

Although I wish we hadn't had this problem today, I'm glad that I now have another tool for troubleshooting problems. And I'm glad that I read that article and its mention of Wireshark stuck in my mind. I really do never know when this stuff will come in handy.

sysadmin/WiresharkTrafficVolume written at 00:48:43; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.