Wandering Thoughts

2016-08-31

Python 3 module APIs and the question of Unicode conversion errors

I have a little Python thing to log MIME attachment type information from Exim; as has been my practice for some time, it's currently written for Python 2. For various reasons beyond the scope of this entry, today I decided to see if I could get it running with Python 3. In the process, I ran into what I have decided to consider a Python 3 API design question.

My Python program peers inside tar, zip, and rar archives in order to get the extensions of files inside them, using the tarfile, zipfile, and rarfile modules for this; the first two are in the standard library, the third is a PyPi addon. This means that the modules (and I) are dealing with file names that may come in from the outside world in essentially any encoding, or even none (as some joker may have stuffed random bytes into the alleged filenames, especially for tar archives). So, how do the modules behave here?

Neither tarfile nor zipfile make any special comments about the file names that they return; in Python 3, this means that they should at least be returning regular (Unicode) strings all of the time, with no surprise bytestrings if they can't decode things. Rarfile supports both Python 2.7 and Python 3, so it sensibly explicitly specifies that its filenames are always Unicode. Tarfile has an explicit section on Unicode issues that answers all of my questions; the default behavior is sensible and you can change it if you want. Both zipfile and rarfile are more or less silent about Unicode issues for reading filenames in archives. Code inspection of zipfile.py in Python 3.5 reveals that it makes no attempt to handle Unicode decoding errors when decoding filenames; if any occur, they will be passed up to you (and there is nothing you can do to set an error handling strategy). Rarfile attempts several encodings and if that fails, tries the default charset with a hard-coded 'replace' error handler.

(On the other hand, many ZIP archives should theoretically not have filename decoding errors because the filenames should explicitly be in UTF-8 and zipfile decodes them as such. But I'm a sysadmin and I deal with network input.)

These three modules represent three different approaches to handling potential Unicode decoding errors in Python 3 in your API (and to documenting them); just assume that you're working in a properly encoded world (zipfile), fully delegate to the user (tarfile), or make a best effort and then punt (rarfile). Since two of these are in the standard library, I'm going to assume that there's no consensus so far on the right sort of API here among the Python 3 community.

My personal preference is for the tarfile approach, since it clearly is the most flexible and powerful. However I think there's a reasonably coherent argument for the zipfile approach under some situations, namely that the module is (probably) not designed to deal with malformed ZIP archives in general. I'd certainly like it if the zipfile module didn't blow up on malformed ZIP archives, but my usage case is a somewhat odd one; most people aren't parsing potentially malicious ZIP archives.

(Tarfile has no choice here, as there is no standard for what the filename encoding is in tar archives. A correctly formed ZIP archive that says 'this filename is UTF-8' should always have a filename that actually is UTF-8 and will decode without errors.)

python/Python3UnicodeAPIQuestion written at 22:54:44; Add Comment

The various IDs of disks, filesystems, software RAID, LVM, et al in Linux

Once upon a time, you put simple /dev/sdX names in your /etc/fstab. These days that's boring and deprecated, and so there are a large number of different identifiers that you can use here. Since I just confused myself on this today, I want to write down what I know and looked up about the various levels and sorts of identifiers, and where they come from. What I care about here are identifiers that are tied to a specific piece of hardware or data, instead of where that hardware is plugged into the system (or the order in which it's recognized during boot, which can totally jump around even when no hardware changes).

Some filesystems have labels, or at least can have labels, and years ago it was common for Linux installs to set labels on your filesystems and use them in /etc/fstab via LABEL=.... This has fallen out of favour since then, for reasons I can only theorize about. ExtN is one such filesystem, and labels can be inspected (and perhaps set) with e2label. Modern Linux distributions seem to no longer set a label on the extN filesystems that they create during installation. Just to confuse you, extN filesystems also keep track of where they were last mounted (or are mounted), which is different from the extN label, and some tools will present this as the 'name' of the filesystem.

(e2label is effectively obsolete today; you should use blkid.)

Many filesystems have UUIDs, as do swap areas, software RAID arrays, LVM objects, and a number of other things. UUIDs are what is commonly used in /etc/fstab these days, and can be displayed with eg 'lsblk -fs'. The blkid command is generally the master source of information about any particular thing. Like labels, UUIDs are embedded in the on-disk metadata of various things; for extN filesystems the filesystem UUID is in the superblock, for example. Where software RAID stores its metadata varies and can matter for some things. Note that software RAID has both a UUID for the overall array and a device UUID for each physical device in the array.

(As blkid will report, GPT partitions themselves have a partition label and a theoretically unique partition UUID. These can also be used in /etc/fstab, per the fstab manpage, but you probably don't want to. The GPT UUID is stored as part of the GPT partition table, not embedded in the partition itself.)

Physical disks have serial numbers (and World Wide Names) that theoretically uniquely identify them. Where they're accessible, Linux reads these via SCSI, SAS, iSCSI, SATA, and so on inquiry commands, and uses this information to populate /dev/disk/by-id. In addition to actual disks, generally anything that appears as a disk-like device with a UUID (or a name) will also show up in /dev/disk/by-id. Thus you can find things like software RAID arrays (by name and UUID), LVM physical volumes, and LVM logical volumes (by name and ID).

(I believe that some USB disk enclosures don't pass through the necessary stuff for Linux to get the disk's serial number.)

Sometimes this can get confusing because the same object winds up with multiple IDs at different levels. A software RAID array or a LVM logical volume that contains an extN filesystem has both a UUID for the filesystem and a UUID for the array or volume, and it may not be clear which UUID you're actually using unless you look in detail. Using blkid is generally fairly clear, fortunately; lsblk's default output is not so much from what I've seen.

(If you're looking at an /etc/fstab generated by an installer or the like, they generally use the filesystem UUID.)

linux/IDsForDisksAndFilesystems written at 00:12:52; Add Comment

2016-08-30

Bourne's getopts sadly makes simple shell scripts more cluttered and verbose

A while back I wrote about how I wanted to use getopts more in my shell scripts in order to have proper real option handling instead of faking it in bad ways. Recently I was modifying a few simple scripts that took arguments, so I decided to do the right thing and switch them from simple hacks to getopts. This worked, but what it showed me is that simple use of getopts is going to make my scripts annoyingly more verbose.

Imagine that I have a script that takes a single option that can change its behavior, say '-f' for 'build Go without also running its native build tests'. In a simple crude script, this is handled like so:

fast=""
if [ "$1" = "-f" ]; then
   fast="y"
fi

This lacks many niceties, but it's short and simple (and in some cases you might just stick the actual extra things to do inside the if condition). The same version with getopts winds up something like this:

usage() { echo "usage: make-all.sh [-f]" 1>&2; exit 1; }
o_fast=
while getopts f opt; do
  case $opt in
    f) o_fast=y;;
    *) usage;;
  esac
done; shift $(($OPTIND-1)); [[ $# != 0 ]] && usage

I've deliberately compacted some lines here in order to make this smaller. One could golf it a bit further, but there are limits to that (both for readability and just in Bourne syntax). And it's still clearly larger (and more complex) than the simple version.

(The actual simple version also uses --fast instead of -f, but getopts doesn't deal with GNU style long options at all; so much for that.)

If I was writing big scripts with complicated argument handling, this wouldn't matter; getopts would be a clear improvement and the code size would be much more comparable. But I have a lot of little scripts that take one or two arguments and do very simple things with command line options, and for those doing things the right way is unavoidably a bunch more verbose.

(Abstracting this into a function that is then in a function library that gets .-included is not a solution because I want my little scripts to be and remain standalone artifacts that I can just copy around freely.)

The result of this is that I wish there was some sort of simple setopts builtin that basically did the simple case; take a set of options, set standard variables for every option present, complain about usage if necessary, and maybe check and complain if you told it how many arguments your command takes. I would use that a bunch, because about 90% of the time that is a great first stage for option and argument handling in my scripts.

(Maybe I need to do a few additional checks for conflicting options if this is an advanced script.)

PS: I'm going to keep the getopts usage in the scripts I've converted and I'm going to try to keep on using it, even in simple scripts. Maybe I'll get acclimatized eventually, and even if I don't it's clearly the right thing to do. Although I really wish getopts could deal with long options, because long options are much better for reminding you what an infrequently used option might do.

programming/BourneGetoptsTooVerbose written at 00:18:59; Add Comment

2016-08-29

Phones and tablets are going to change what sort of passwords I use

For a fairly long time now I've been using strong random passwords for websites and other Internet authentication needs (as covered here). These random passwords are generated from an alphabet of upper case, lower case, and numbers; a typical twelve-character one is Hx35n7uVmTaS (I have a script that generates a few of them for me). Although cutting and pasting them into browsers is the easiest and best approach, they work out okay even if I have to enter them by hand on a computer.

Then I got an iPad Mini at work and suddenly the pain began. All of those nice random passwords turned out to be a complete pain to enter on the iPad's software keyboard. You see, on the iPad, lower case is one keyboard bank, upper case is another, and digits are a third bank. Every time one of my random passwords had a lower case letter followed by an upper case letter or a number, that's a bank shift, and bank shifts really slow you down (or at least they slow me down). Naturally, all of my strong random passwords had a lot of bank shifts; some of them shifted practically every character.

It's become clear to me that I very much want a different sort of random password for any password I'm going to be entering on a tablet (or on a future smartphone). A mixture of lower case and something else is somewhere between a good idea and necessary, but I don't want very many shifts between the two (or three); instead I probably want relatively large blocks of the same sort of character.

All of this is interesting to me because I had not previously really thought about how input methods strongly influence the sort of passwords we want to use. Which, well, of course they do. If you have to enter passwords at all, many people are only going to be willing to put up with so much pain. They're naturally going to pick passwords that are reasonably easy to enter in whatever they're using, whether this is a computer, a phone or a tablet, or something with an even more restricted or awkward text entry methods.

(And if you generate random passwords for people, for example for VPN access, you may want to think about how and where people will be entering them. Of course in most situations people only enter them once, but still.)

tech/SoftwareKeyboardsAndPasswords written at 00:13:53; Add Comment

2016-08-28

My logic of blocking certain sorts of attachments outright

Once we started knowing more about what sort of attachments our users get (both good and bad), as we now do, we drifted into the obvious next step of starting to block some of them. Our current set of blocks are conservative and have basically been drive by seeing what our commercial anti-spam package already dislikes. Is it identifying basically everything with a single .js file in a ZIP archive as bad? Well, let's just go ahead and block those outright.

Now, there's a perfectly sensible question that one could ask here: if our anti-spam package is already detecting this stuff as bad, why bother doing anything to preempt it? We're essentially going to some amount of effort to duplicate work that's already being done for us.

(How we do this may result in somewhat less load on the overall system, but our external email gateway system isn't particularly burdened in the first place so this isn't something we care about.)

My biggest reason for arguing in favour of general early blocking is simple: it's unwise to count on recognizing all malware. Unless the commercial package we use has adopted rules that are just as general as ours (and we can't tell, it's a black box), it's using some sort of signatures or pattern analysis or recognition system to pick out malware. It may be batting a thousand on those malware .js files now, but there's no guarantee that this will always be the case; at some point in the future the .js malware may mutate so that the package doesn't recognize it for a while. So where we can use them, our general blocks act as a safety backstop to the more nuanced detection that the commercial package is doing.

(There are cases where we can't use general blocking. Our users get a certain amount of legitimate Java .jars and Office documents with macros, so we can't block either of those outright despite both of them being vectors for malware. Instead we have to cross our fingers that the commercial package's recognition is good enough and our users will be sufficiently suspicious of anything that does slip through.)

In our situation a secondary reason is to shield the commercial anti-spam package from having to apply potentially buggy code to potential malware files. The package definitely has had bugs in the past and probably will in the future; if we're sufficiently confident that we don't want certain things no matter what, we can sidestep any potential problems in the package's parsing and recognition systems by just refusing things first without peering into them.

(Of course if you start thinking too much about this you wind up gibbering in the corner. The infosec world and AFL have been very good at blowing giant holes in complex parsers, which is exactly what pretty much every anti-spam and anti-malware system has.)

spam/BlockingAttachmentTypesLogic written at 00:43:16; Add Comment

2016-08-26

Why ZFS L2ARC is probably not a good fit for our setup

Back when I wrote up our second generation of ZFS fileservers, I mentioned in an aside that we expected to eventually add some L2ARCs to our pools. Since then, I've started to think that L2ARCs are not a good fit for our particular and somewhat peculiar setup. The simple problem is that as far as I know, there is no such thing as a L2ARC that's shared between pools.

The easy and popular way to think about L2ARC is right there in its name; it's a second level cache for the in-RAM ARC. As with traditional kernel buffer caches, the in-RAM ARC is a shared, global cache for all ZFS pools on your system, where space is allocated to active data regardless of where it comes from. If you have multiple pools and one pool is very hot, data from it can wind up taking up most of the ARC; when the pool cools down and others start being active, the ARC shifts to caching them instead.

L2ARC doesn't behave like this, because a given L2ARC device can't be shared between pools. You don't and can't have a global L2ARC, with X GB of fast SSD space that simply holds the overflow from the ARC regardless of where that overflow came from. Instead you must decide up front how much of your total L2ARC space each pool will get (and I believe that how much L2ARC space you have in total has an impact on how much RAM will get used for L2ARC metadata). A hot pool cannot expand its L2ARC usage beyond what you gave it, and a cool pool cannot donate some of its unneeded space to a hot pool.

My impression is that many people operate ZFS servers where there are only one or two active pools (plus maybe the system pool). For these people, a per-pool L2ARC is effectively the same or almost the same as a global L2ARC. We are not in this situation. For administrative reasons we split different people and groups into different pools, which means that each of our three main fileservers has nine or ten ZFS pools.

As far as I can see, adding a decent sized L2ARC for each pool would rapidly put us over the normal recommendations for total L2ARC size (we have only 64 GB of RAM on each fileserver). Adding a small enough L2ARC to each pool to keep total L2ARC size down is likely to result in L2ARCs that are more decorative than meaningful. And splitting the difference probably doesn't really help either side; we might well wind up with excessive RAM use for L2ARC metadata and L2ARCs that are too small to be really useful. If we're going to spend money here at all, it would probably be more sensible and useful to add more RAM.

(RAM costs more than SSDs, but it's automatically global and thus balanced dynamically between pools. Whatever hot data needs it just gets it, no tuning required.)

All of this leads me to the conclusion that L2ARC is probably not a good fit for a situation like ours where you have a bunch of pools and your activity (and desire for L2 caching) is spread relatively evenly over them all. You can maybe make it work and it might improve things a bit, but the effort to performance increase ratio doesn't seem likely to be all that favorable.

(This old entry of mine has some information on L2ARC metadata memory requirements, although I don't know if things have changed since 2013.)

solaris/ZFSMultiPoolL2ARCProblem written at 23:25:54; Add Comment

2016-08-25

The single editor myth(ology)

One of the perniciously popular stories that you'll hear people saying and repeating all around the tech world is that you imprint on your first serious editor (or the first editor you master) and you can never really shift to another one. If you start out with GNU Emacs it's absurd to think that you can really move to vim, and vice versa, and so on. Among other things this leads people at the start of their career to worry about what editor to learn and to try to figure out which one they can stay with for a long time.

This story is garbage.

The reality is that editors are tools, just like many other things you use (such as programming languages). Yes, they're complex tools that you spend a lot of time with, but they're still tools. And you can and will learn and use many tools over the course of your career, editors included. It is both absurd and limiting to believe that you can't (or won't) shift tools repeatedly over the course of your career, or to insist that you can't possibly do so. Computing is far too fluid for that (and I say this as someone with an extremely durable environment).

(I lump IDEs in with editors here.)

Also, it's not as if editors and especially the environments that surround them are going to stand still over the multiple decades of your career. Let's take GNU Emacs as an example. The core behavior of GNU Emacs may be thirty or more years old, but the packages and add-ons around it that make up a high quality editing environment have changed, are changing, and will continue changing and evolving in the future. The high quality GNU Emacs environment of 1996 was quite different from the high quality GNU Emacs environment of 2016, and the 2026 version is likely to be different again. Much as with browsers and addons, this effectively creates a different editor over time, albeit one with broad similarities to its past self.

(One obvious thing driving the changes in sophisticated editor environment are the changes in everything else in computing. There was no git or Mercurial in 1996, for example, so there were no packages to smoothly integrate with them.)

So all of these 'single editor' myths are wrong. Your first editor won't be your entire world (although it will inevitably shape your early views of editing and what you like); you can change your main editor over time (and you probably will); it's perfectly possible to be fluent in multiple editors and editing environments at once; and it's even possible to like different editors for different things instead of trying to do everything in one.

There are sensible reasons to focus on a single editor (or a few) instead of flitting around from editor to editor at the drop of a hat, but these are the same reasons that you might focus your programming efforts on a few languages instead of chasing after every interesting seeming one that catches your eye. And if you're so inclined, you might as well try out editors just as you try out programming languages, either to see if you like them or simply to expand your ideas on what's possible and useful.

(I am not so inclined in either editors or programming languages. With editors it generally feels as if there is very little point in exploring a new one, partly because most of my needs are modest and already pretty well met by editors that I already know. But I know that there are people out there who delight in taking new editors out for test drives.)

tech/SingleEditorMyth written at 23:35:00; Add Comment

2016-08-24

Blindly trying to copy a web site these days is hazardous

The other day, someone pointed a piece of software called HTTrack at Wandering Thoughts. HTTrack is a piece of free software that makes offline copies of things, so I presume that this person for some reason wanted this. I don't think it went as they intended and wanted.

The basic numbers are there in the logs. Over the course of a bit over 18 hours, they made 72,393 requests and received just over 193 MBytes of data. Needless to say, Wandering Thoughts does not have that many actual content pages; at the moment there are a bit over 6400 pages that my sitemap generation code considers to be 'real', some of them with partially duplicated content. How did 6400 pages turn into 72,000? Through what I call 'virtual directories', where various sorts of range based and date based views and so on are layered on top of an underlying directory structure. These dynamic pages multiply like weeds.

(I'm reasonably sure that 72,000 URLs doesn't cover them all by now, although I could be wrong. The crawl does seem to have gotten every real page, so maybe it actually got absolutely everything.)

Dynamic views of things are not exactly uncommon in modern software, and that means that blindly trying to copy a web site is very hazardous to your bandwidth and disk space (and it is likely to irritate the target a lot). You can no longer point a simple crawler (HTTrack included) at a site or a URL hierarchy and say 'follow every link', because it's very likely that you're not going to achieve your goals. Even if you do get 'everything', you're going to wind up with a sprawling mess that has tons of duplicated content.

(Of course HTTrack doesn't respect nofollow, and it also lies in its User-Agent by claiming to be running on Windows 98. For these and other reasons, I've now set things up so that it will be refused service on future visits. In fact I'm in a sufficiently grumpy mood that anything claiming to still be using Windows 98 is now banned, at least temporarily. If people are going to lie in their User-Agent, please make it more plausible. In fact, according to the SSL Server Test, Windows 98 machines can't even establish a TLS connection to this server. Well, I'm assuming that based on the fact that Windows XP fails, as the SSL Server Test doesn't explicitly cover Windows 98.)

PS: DWiki and this host didn't even notice the load from the HTTrack copy. We found out about it more or less by coincidence; a university traffic monitoring system noticed a suspiciously high number of sessions from a single remote IP to the server and sent in a report.

web/SiteCopiesAreHazardous written at 22:18:27; Add Comment

more, less, and a story of typical Unix fossilization

It all started on Twitter:

@palecur: I know enough about Unix to get along but you will never convince me of a meaningful difference between 'less' and 'more'

@thatcks: In the genius Unix tradition, the answer is that less is more.

(Sadly, this is true at about 3 to 4 levels. It's a long story.)

In the beginning, by which we mean V7, Unix didn't have a pager at all. That was okay; Unix wasn't very visual in those days, partly because it was still sort of the era of the hard copy terminal. Then along came Berkeley and BSD. People at Berkeley were into CRT terminals, and so BSD Unix gave us things like vi and the first pager program, more (which showed up quite early, in 3BSD, although this isn't as early as vi, which appears in 2BSD). Calling a pager more is a little bit odd but it's a Unix type of name and from the beginning more prompted you with '--More--' at the bottom of the screen.

All of the Unix vendors that based their work on BSD Unix (like Sun and DEC) naturally shipped versions of more along with the rest of the BSD programs, and so more spread around the BSD side of things. However, more was by no means the best pager ever; as you might expect, it was actually a bit primitive and lacking in features. So fairly early on Mark Nudelman wrote a pager with somewhat more features and it wound up being called less as somewhat of a joke. When less was distributed via Usenet's net.sources in 1985 it became immediately popular, as everyone could see that it was clearly nicer than more, and pretty soon it was reasonably ubiquitous on Unix machines (or at least ones that had some degree of access to stuff from Usenet). In 4.3 BSD, more itself picked up the 'page backwards' feature that had motived Mark Nudelman to write less, cf the 4.3BSD manpage, but this wasn't the only attraction of less. And this is where we get into Unix fossilization.

In a sane world, Unix vendors would have either replaced their version of more with the clearly superior less or at least updated their version of more to the 4.3 BSD version. Maybe less wouldn't have replaced more immediately, but certainly over say the next five years, when it kept on being better and most people kept preferring it when they had a choice. This would have been Unix evolving to pick a better alternative. In this world, basically neither happened. Unix fossilized around more; no one was willing to outright replace more and even updating it to the 4.3 BSD version was a slow thing (which of course drove more and more people to less). Eventually the Single Unix Specification came along and standardized more with more features than it originally had but still with a subset of less's features (which had kept growing).

This entire history has led to a series of vaguely absurd outcomes on various modern Unixes. On Solaris derivatives more is of course the traditional version with source code that can probably trace itself all the way back to 3BSD, carefully updated to SUS compliance. Solaris would never dream of changing what more is, not even if the replacement is better. Why, it might disturb someone.

(I am not a fan of Solaris's long standing refusal to touch anything. Well, Solaris before Oracle took it over. I haven't looked at Solaris 11, just at Solaris 10 and derivatives like Illumos.)

Oddly, FreeBSD has done the most sensible thing; they've outright replaced more with less. There is a /usr/bin/more but it's the same binary as less and as you can see the more manpage is just the less manpage. OpenBSD has done the same thing but has a specific manpage for more instead of just giving you the less manpage.

On Linux, more is part of the util-linux package but its manpage outright tells you to use less instead:

more is a filter for paging through text one screenful at a time. This version is especially primitive. Users should realize that less(1) provides more(1) emulation plus extensive enhancements.

Given the comments in the manpage, it appears that this version of more is directly derived from the source code of one of the BSD versions. It might even have less changes from the original than the Solaris version.

So, now you can see why I say that less is more, or more, or both, at several levels. less is certainly more than more, and sometimes less literally is more (or rather more is less, to put it the right way around).

unix/MoreAndUnixFossilization written at 00:49:57; Add Comment

2016-08-23

Link: File crash consistency and filesystems are hard

Dan Luu's File crash consistency and filesystems are hard is about what it says it's about, with examples and interesting academic citations (that are kind of depressing) and the whole lot. If you want more reading, it has a whole bunch of links to further papers that you can go through. Most of them are apparently depressing reading, because the capsule summary is 'this stuff is hard and almost no one gets it completely right'; not user programs, not filesystems, and not the actual disk drives.

(Sometimes the 'not getting it right' bit is more accurately called 'cheating in the name of good benchmarks'.)

(I have a long standing interest in this area, as do many sysadmins. People tend to get upset if our systems lose their email during power failures and so on. Sometimes we cheat, but we like to do this with our eyes open.)

links/FileConsistencyHard written at 17:28:19; Add Comment

Link: Git from the inside out

Mary Rose Cook's Git from the inside out is a highly detailed and thus fascinating recounting of exactly how Git's graph structure and on disk tracking of things works as you evolve a repository. I knew many of the broad strokes just from general git knowledge but the details are illuminating and quite useful, especially the details around what exactly happens and gets recorded where during more advanced operations like merges (especially with conflicts) and pulls.

(I care about git things at this level of detail because they let me understand what's going on and what I can do about it when things don't go the way I expect them to. I'm not left poking futilely at a black box; instead I have the reassuring feeling that I can at least peek inside.)

links/GitInsideOut written at 17:16:12; Add Comment

(Previous 11 or go back to August 2016 at 2016/08/22)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.