2016-07-03
An irritating little bug in the latest GNU Emacs Python autoindent code
I really like having smart autoindent in my editor when writing code, Python code included. When it works, autoindent does exactly what I would do by hand, does it easier, and in the process shows me errors in my code (if the autoindent is 'wrong', it is a signal that something earlier is off). But the flipside of this is that when autoindent goes wrong it can be a screaming irritation, as I flip from working with my editor to actively fighting it.
Unfortunately the latest official version of GNU Emacs has such an
issue in its Python autoindent code, under conditions that are
probably rare. To see the bug, set Emacs up with python-indent-offset
set to 8 and indent-tabs-mode set to t, and then enter:
def abc(): if d in e: pass # Hit return here:
If you put your cursor on the end of the comment and hit return,
autoindent doesn't add any indentation at all. It should add one
level of indentation. Also, once you have this code in a .py file
you don't need to set anything in Emacs; Emacs will auto-guess that
the indent offset is 8 and then the mere presence of tabs will cause
things to explode. This makes this issue especially annoying and/or
hazardous.
Some people will say that this serves me right for still using tabs for indentation in my Python code. I'm aware that there's been a general movement in the Python community to indent all Python code with only spaces, regardless of how much you indent it by, but for various reasons I have strongly resisted this. One of them is that I edit Python code in multiple editors, not all of them ones with smart autoindentation, and space-based indenting is painful in an editor that doesn't do it for you. Well, at least using generous indents with manual spaces is painful, and I'm not likely to give that up any time soon.
(I like generous indents in code. Small indent levels make everything feel crammed together and it's less obvious if something is misindented when everything is closer. Of course Python's many levels of nesting doesn't necessarily make this easy; by the time I'm writing an inner function in a method in a class, I'm starting to run out of horizontal space.)
PS: I suspect that I'm going to have to give up my 'indent with tabs' habits some day, probably along with my 8-space indents. The modern Python standard seems to be 4-space indent with spaces and there's a certain amount to be said for the value of uniformity.
(People are apparently working on Python equivalents of Go's gofmt,
eg yapf. This doesn't entirely
make my issues go away, but at least it would give me some tools
to more or less automatically convert existing code over so that I
don't have to deal with a mismash of old and new formatting in
different files or projects.)
2016-07-02
cal's unfortunate problem with argument handling
Every so often I want to see a calendar just to know things like what
day of the week a future date will be (or vice versa). As an old Unix
person, my tool for this is cal. Cal is generally a useful program,
but it has one unfortunate usage quirk that arguably shows a general
issue with Unix style argument handling.
By default, cal just shows you the current month. Suppose that
you are using cal at the end of June, and you decide that you
want to see July's calendar. So you absently do the obvious thing
and run 'cal 7' (because cal loves its months in decimal form).
This does not do what you want; instead of seeing the month calendar
for July of this year, you see the nominal full year calendar for
AD 7. To see July, you need to do something like 'cal 7 2016'
or 'cal -m 7'.
On the one hand, this is regrettably user hostile. 'cal N' for N
in the range of 1 to 12 is far more likely to be someone wanting
to see the given month for the current year than it is to be someone
who wants to see the year calendar for AD N. On the other hand,
it's hard to get out of this without resorting to ugly heuristics.
It's probably equally common to want a full year calendar from cal
as it is to want a different month's calendar, and both of these
operations would like to lay claim to the single argument 'cal N'
invocation because that's the most convenient way to do it.
If we were creating cal from scratch, one reasonably decent option
would be to declare that all uses of cal without switches to
explicitly tell it what you wanted were subject to heuristics. Then
cal would have a license to make 'cal 7' mean July of this year
instead of AD 7, and maybe 'cal 78' mean 'cal 1978' (cf the
note in the V7 cal manpage). If
you really wanted AD 7's year calendar, you'd give cal a switch
to disambiguate the situation; in the mean time, you'd have no
grounds for complaint. But however nice it might be, this would
probably strike people as non-Unixy. Unix commands traditionally
have predictable argument handling, even if it's not friendly,
because that's what Unix considers more important (and also easier,
if we're being honest).
In a related issue, I have now actually read the manpages for modern
versions of cal (FreeBSD and Linux use different implementations)
and boy has it grown a lot of options by now (options that will
probably make my life easier if I can remember them and remember
to use them). Reassuringly, the OmniOS version of cal still takes
no switches; it's retained the V7 'cal [[month] year]' usage
over all of these years.
2016-07-01
How backwards compatibility causes us pain with our IMAP servers
One of the drawbacks of operating a general purpose Unix environment for decades is that backwards compatibility can wind up causing you to get trapped in troublesome situations. In particular, the weight of backwards compatibility has wound up requiring us to configure our IMAP server environment in a way that causes various problems.
Unix IMAP servers generally have a setting for where the various
IMAP mailboxes and folders are stored on disk. Back when we first set up UW-IMAP at least two decades ago,
we wound up with a situation where UW-IMAP looked for and expected
to find people's mail under their home directory, $HOME. People
could manually put them in a subdirectory of $HOME if they wanted,
or they could just drop them straight in $HOME.
(Based on old historical UW-IMAP source, this wasn't even a configuration option at the time. It was just what UW-IMAP was hard-coded to assume about mailbox and folder layout for everything except your INBOX mailbox.)
Some people left things as they were and had various mailboxes in
$HOME. Some people decided to be neater and put everything in a
subdirectory, but which subdirectory they picked was varied; some
people used $HOME/Mail, some people used $HOME/IMAP, and so on.
As we upgraded our IMAP server software over the years, eventually
moving from UW-IMAP to Dovecot, we had to keep this configuration
setting intact. If we dared change it, for example to say that all
IMAP mailboxes and folders would henceforth be in $HOME/IMAP, we
would be forcing lots of people to either change their client's
IMAP configuration or relocate files and directories at the Unix
level (and probably both for some people). This would have been a
massive flag day and a massive disruption to our entire user base,
not all of which are even on campus, with serious effects on their
access to much of their email if things didn't go exactly right.
Now, there are two problems with an IMAP server that thinks your
mailboxes and folders start in $HOME. The lesser problem is that
if you ask the IMAP server for a list of all of your top level
folders and mailboxes, you get a ls of $HOME (complete with
all of your dotfiles). This is at least a bit annoying and it turns
out that some software doesn't cope well with this, including our
webmail system.
(We wound up having to force our webmail system to confine itself
to a subfolder of the IMAP namespace and thus a subdirectory of
$HOME. People who wanted to use webmail had to do some Unix and
IMAP rearrangement, but at least this was an opt-in change; people
who didn't care about webmail were unaffected.)
The more serious problem is that there is an IMAP operation that
requires recursively finding all of your folders, subfolders, and
mailboxes. This obviously requires recursing through the actual
directory structure, and Dovecot will do this without limit and
it follows directory symlinks. If you have a symlink somewhere
under your $HOME that creates a cycle, Dovecot will follow this
endlessly. If you have a symlink that escapes from your $HOME
into the wider filesystem, Dovecot will also follow this and start
trying to walk around (where it may hit someone else's symlink
cycle). In either case, your Dovecot process basically hangs there
and hammers away at our fileservers.
We're very fortunate in that very few clients seem to invoke this IMAP operation and so hung Dovecot processes using up CPU and NFS bandwidth are pretty uncommon. But they're not unknown; we get a few every so often. And it's mostly because of this backwards compatibility need.
2016-06-30
Some advantages of using argparse to handle arguments as well as options
I started using Python long enough ago that there was only the getopt module, which was okay because that's what I was used to from C and other Unix languages (shell, Perl, etc), and then evolved for a bit through optparse; I only started using argparse relatively recently. As a result of all of this background, I'm used to thinking of 'argument handling' as only processing command line switches and their arguments for you, and giving you back basically a list of the remaining arguments, which is your responsibility to check how many there are, parse, and so on.
Despite being very accustomed to working this way, I'm starting to abandon it when using argparse. Part of this is what I discovered the first time I used argparse, namely that it's the lazy way. But I've now used argparse a second time and I'm feeling that there are real advantages to letting it handle as many positional arguments as possible in as specific a way as possible.
For instance, suppose that you're writing a Python program that takes exactly five positional arguments. The lazy way to handle this is simply:
parser.add_argument("args", metavar="ARGS", nargs=5)
If you take exactly five arguments, they probably mean different things. So the better way is to add them separately:
parser.add_argument("eximid", metavar="EXIMID")
parser.add_argument("ctype", metavar="CONTENT-TYPE")
parser.add_argument("cdisp", metavar="CONTENT-DISPOSITION")
parser.add_argument("mname", metavar="MIMENAME")
parser.add_argument("file", metavar="FILE")
Obviously this gives you easy separate access to each argument in
your program, but the really nice thing this does is that it adds
some useful descriptive context to your program's usage message.
If you choose the metavar values well, your usage message will
strongly hint to what needs to be supplied as each argument. But
we can do better, because argparse is perfectly happy to let you
attach help to positional arguments as well as to switches (and it
will then print it out again in the usage message, all nicely
formatted and so on).
You can do the same thing by hand, of course; there's nothing preventing you from writing the same documentation with manual argument parsing and printing it out appropriately (although argparse does do a good job of formatting it). But it feels easier with argparse and it feels more natural, because argparse lets me put everything to do with a positional argument in one spot; I can name the internal variable, specify its visible short name, and then add help, all at once. If nothing else, this is likely to keep all of these things in sync with each other.
(And I'm not going to underestimate the importance of automatic good formatting, because that removes a point of friction in writing the help message for a given positional argument.)
The result of all of this is that using argparse for positional arguments in my latest program has effortlessly given me not just a check for having the right number of positional arguments but a bunch of useful help text as well. Since I frequently don't touch programs for a year or two, I foresee this being a useful memory jog for future me.
In summary, if I can get argparse to handle my positional arguments in future Python programs, I'm going to let it. I've become convinced that it's not just the lazy way, it's the better way.
(This is where some Python people may laugh at me for having taken so long to start using argparse. In my vague defense, we still have some machines without Python 2.7.)
What makes a email MIME part an attachment?
If you want to know what types of files your users are getting
in email, it's likely that an important
prerequisite is being able to recognize attachments in the first
place. In a sane and sensible world, this would be easy; it would
just be any MIME part with a Content-Disposition header of
attachment.
I regret to tell you that this is not a sane world. There are mail clients that
give every MIME part an inline Content-Disposition, so naturally
this means that most mail clients can't trust an inline C-D and
make their attachment versus non-attachment decisions based on other
things. (I expect that there are mail clients that ignore a C-D of
attachment, too, and will display some of those parts inline if
they feel like it, but for logging we don't care much about that.)
MIME parts may have (proposed, nominal) filenames associated with
them, from either Content-Type or Content-Disposition. However,
neither the presence nor the absence of a MIME filename determines
something's attachment status. Real attachments may have no proposed
filename, and there are mail clients that attach filenames to things
like inline images. And really, I can't argue with them; if the user
told you that this (inline) picture is mydog.jpg, you're certainly
within your rights to pass this information on in the MIME headers.
The MIME Content-Type provides at least hints, in that you can
probably assume that most mail clients will treat things with any
application/* C-T as attachments and not try to show them inline.
And if you over-report here (logging information on 'attachments'
that will really be shown inline), it's relatively harmless. It's
possible that mail clients do some degree of content sniffing, so
the C-T is not necessarily going to determine how a mail client
processes a MIME part.
(At one point web browsers were infamous for being willing to
do content sniffing on HTTP replies, so that what you served
as eg text/plain might not be interpreted that way by some
browsers. One can hope that mail clients are more sane, but
I'm not going to hold my breath there.)
One caution here: trying to make decisions based on things having
specific Content-Type values is a mug's game. For example, if you're
trying to pick out ZIP files based on them having a C-T of
application/zip, you're going to miss a ton of them; actual real
email has ZIP files with all sorts of MIME types (including the
catch-all value of application/octet-stream). My impression is that
the most reliable predictor of how a mail client will interpret an
attachment is actually the extension of its MIME filename.
(While the gold standard for figuring out if something is a ZIP
file or whatever is actually looking at the data for the MIME part,
please don't use file (or libmagic) for general file classification.)
One solution is certainly to just throw up our hands and log everything; inline, attachment, whatever, just log it all and we can sort it out later. The drawback on this is that it's going to be pretty verbose, even if you exclude inline text/plain and text/html, since lots of email comes with things like attached images and so on.
The current approach I'm testing is to use a collection of signs
to pick out attachment-like things, with some heuristics attached.
Does a MIME part declare a MIME filename ending in .zip? Then
we'll log some information about it. Ditto if it has a Content-Disposition
of attachment, ditto if it has a Content-Type of application/*,
and so on. I'm probably logging information about some things that
mail clients will display inline, but it's better than logging too
little and missing things.
(Then there is the fun game of deciding how to exclude frequently emailed attachment types that you don't care about because you'll never be able to block them, like PDFs. Possibly the answer is to log them anyways just so you know something about the volume, rather than to try to be clever.)
2016-06-29
Modern DNS servers (especially resolvers) should have query logging
Since OpenBSD has shifted to using Unbound as their resolving DNS server, we've been in the process of making this shift ourselves as we upgrade, for example, our local OpenBSD-based resolver machines. One of the things this caused me to look into again is what Unbound offers for logging, and this has made me just a little bit grumpy.
So here is my opinion:
Given the modern Internet environment, every DNS server should be capable of doing compact query logging.
By compact query logging, I mean something that logs a single line with the client IP, the DNS lookup, and the resolution result for each query. This logging is especially important for resolving nameservers, because they're the place where you're most likely to want this data.
(How the logging should be done is an interesting question. Sending it to syslog is probably the easiest way; the best is probably to provide a logging plugin interface.)
What you want this for is pretty straightforward: you want to be able to spot and find compromised machines that are trying to talk to their command & control nodes. These machines leave traces in their DNS traffic, so you can use logs of that traffic to try to pick them out (either at the time or later, as you go back through log records). Sometimes what you want to know and search is the hosts and domains being looked up; other times, you want to search and know what IP addresses are coming back (attackers may use fast-flux host names but point them all at the same IPs).
(Quality query logging will allow you relatively fine grained control over what sort of queries from who get logged. For example, you might decide that you're only interested in successful A record lookups and then only for outside domains, not your own ones.)
Query logging for authoritative servers is probably less useful, but I think that it should still be included. You might not turn it on for your public DNS servers, but there are other cases such as internal DNS.
As for Unbound, it can sort of do query logging but it's rather verbose. Although I haven't looked in detail, it seems to be just the sort of thing you'd want when you have to debug DNS name resolution problems, but not at all what you want to deal with if you're trying to do this sort of DNS query monitoring.
2016-06-27
Today's lesson on the value of commenting your configuration settings
We have a relatively long-standing Django web application that was first written for Django 1.2 and hadn't been substantially revised since. Earlier this year I did a major rework in order to fully update it for Django 1.9; not just to be compatible, but to be (re)structured into the way that Django now wants apps to be set up.
Part of Django's app structure is a settings.py file that contains,
well, all sorts of configuration settings for your system; you
normally get your initial version of this by having Django create
it for you. What Django wants you to have in this file and how it's
been structured has varied over Django versions, so if you have a
five year old app its settings.py file can look nothing like what
Django would now create. Since I was doing a drastic restructuring
anyways, I decided to deal with this issue the simple way. I'd have
Django write out a new stock settings.py file for me, as if I was
starting a project from scratch, and then I would recreate all of
the settings changes we needed. In the process I would drop any
settings that were now quietly obsolete and unnecessary.
(Since the settings file is just ordinary Python, it's easy to wind up setting 'configuration options' that no longer exist. Nothing complains that you have some extra variables defined, and in fact you're perfectly free to define your own settings that are used only by your app, so Django can't even tell.)
In the process of this, I managed to drop (well, omit copying) the
ADMINS setting that makes Django send us email if there's an
uncaught exception (see Django's error
reporting documentation).
I didn't spot this when we deployed the new updated version of the
application (I'm not sure I even remembered this feature). I only
discovered the omission when Derek's question here
sent me looking at our configuration file to find out just what
we'd set and, well, I discovered that our current version didn't
have anything. Oops, as they say.
Looking back at our old settings.py, I'm pretty certain that I omitted
ADMINS simply because it didn't have any comments around it to tell
me what it did or why it was there. Without a comment, it looked like
something that old versions of Django set up but new versions didn't
need (and so didn't put into their stock settings.py). Clearly if I'd
checked what if anything ADMINS meant in Django 1.9 I'd have spotted
my error, but, well, people take shortcuts, myself included.
(Django does have global documentation for settings, but there is no global alphabetical index of settings so you can easily see what is and isn't a Django setting. Nor are settings grouped into namespaces to make it clear what they theoretically affect.)
This is yet another instance of some context being obvious to me
at the time I did something but it being very inobvious to me much
later. I'm sure that when I put the ADMINS setting into the initial
settings.py I knew exactly what I was doing and why, and it was
so obvious I didn't think it needed anything. Well, it's five years
later and all of that detail fell out of my brain and here I am,
re-learning this lesson yet again.
(When I put the ADMINS setting and related bits and pieces back
into our settings.py, you can bet that I added a hopefully clear
comment too. Technically it's not in settings.py, but that's
a topic for another blog entry.)
If you send email, don't expect people to help you with abuse handling
I'll start with the tweets:
@thatcks: I see these spammers used @MailChannels to hit us once before, in April. I reported them then, but I have no time for this shit any more.
Back in April, a persistent long-term spammer of one of our addresses attempted to send it spam via MailChannels, a commercial email sending outfit. I complained to MC's abuse contacts at the time, because I'm an optimist, and someone at MC got back to me to tell me this spammer had been fixed. Then they came back now (well, a couple of days ago).
@thatcks: As has been said many, many times before, expecting the receivers of email to be your anti-spam detection method is utterly broken.
Some people might say that I should do the 'responsible' thing and once again report this incident to MailChannels. These people are wrong. It is always the sender's responsibility to detect that they are sending spam and take steps to deal with it; as has been said many years ago, abuse reports are a gift (one that comes from fewer and fewer people these days). In my case, my only real interest is in making the spam stop and generally I have far more effective ways of doing this than sending in complaints.
(By the way, I hope we can agree that there is absolutely no moral basis for saying that people have a responsibility to report spam. If your service is spamming me, I am getting absolutely nothing out of this and I accordingly owe you absolutely nothing. In fact, morally speaking you owe me for inflicting costs on me.)
In this specific situation, it's also clear that sending in complaints is not effective (cf). After all, I already did that once, got an assurance that it was dealt with, and the spammer came back a couple of months later. A repeat report is likely to net exactly the same result at best.
Then MailChannels popped up:
@MailChannels: @thatcks We don't take abuse of our network lightly and are keen to investigate. Please send us sample messages to support@mailchannels.com
This is a form tweet. It betrays at least an inability to read my original message.
(Replying to aggravated people with form tweets that betray a lack of thinking human involvement is, at the least, going to aggravate them further. So it proved here.)
@thatcks: .@MailChannels You're asking me to do more work to help you out. Why would I do that? If you want, you have enough information already.
I gave the form tweet all the response I felt that it deserved. And it's true that MailChannels has all the information they need; they could just search their April abuse reports for my name, find the address here that I reported was hit, and see if that address was sent to recently. Why yes, yes it was. MailChannels' email to it was even rejected this time around too, which really ought to be one of a number of danger signs for MailChannels. Certainly this would take some work on MailChannels' part, but you know, they're the people that this benefits, not me; I've already taken effective steps on our side.
(MailChannels benefits because they get rid of a spammer who may drag their reputation down and damage the deliverability of email for other paying customers, which would cost MailChannels money.)
Of course, I expect that MailChannels did nothing here. That's the easy way to blow off problem indicators while feeling good about yourself; you can say 'well, if it was real the person would have totally taken us up on our offer'. They can tick off the 'we tried' box and consider the matter done. And really, what mail sending service can afford to actually do a good job with spam?
(Applications of this pattern to, say, bug reports and bug trackers are left as an exercise for the reader.)
2016-06-26
How not to maintain your DNS (part 22)
Much as the previous installment, this example of bad DNS setup is sufficiently complicated that it's best illustrated in text instead of trying to show DNS output.
We start with the domain zshine.com. At the moment its WHOIS registration says that it has the DNS servers ns1.gofreeserve.com and ns2.gofreeserve.com. If you query the nameservers for .com, they will agree with this and give you an IP for each nameserver, 192.196.158.106 and 192.196.159.106 respectively.
According to WHOIS, gofreeserve.com's registered nameservers are (ns1 ns2 ns11 ns12).lampnetworks.com, and the .com nameservers agree with this. All of these nameservers report themselves as authoritative for gofreeserve.com. None of them know about either ns1.gofreeserve.com or ns2.gofreeserve.com; in fact they authoritatively claim that neither exist.
As the capstone, neither 192.196.158.106 nor 192.196.159.106 respond to DNS requests, so even if you accept the glue records from the .com nameservers you can't actually resolve anything about zshine.com. Nor do the lampnetworks.com nameservers have any information about zshine.com.
The results of this are somewhat interesting. Obviously, zshine.com essentially doesn't exist in DNS; you can't look up an A or MX record for it. Working out why can be a little bit tricky, though. With at least some resolving DNS servers, all you get is a timeout when you query for even just zshine.com's NS records. In order to hunt things down I had to go digging in WHOIS data and then looking at gofreeserve.com's own DNS data.
As far as I can guess, this is a version of glue record hell. Gofreeserve does appear to offer DNS handling as one of their services, and at some point clearly it was done through those ns1 and ns2 DNS names. However, things have changed since and not all domains that used them have had their WHOIS data updated. In fact, perhaps some domains have been dropped entirely by Gofreeserve but haven't changed anything. Without glue records in the DNS, we'd probably get a failure to resolve the listed nameservers. With glue records, well, clearly some of the time we get a timeout trying to query them.
(Some casual Internet searches suggest that there are any number
of domains still using ns[12].gofreeserve.com as their DNS servers.
I won't speculate why the people behind these domains don't seem
to have noticed that they don't work any more, although this case
may have a relatively sensible reason, namely that this is probably
a secondary domain name for a firm with their primary domain name
in .cn.)
PS: Since the occasion for me noticing this issue with zshine.com is something claiming to be it trying to send email to my spamtraps, I'm not too upset about its DNS issues.
2016-06-25
What Python 3 versions Django supports, and when this changes
I was idly skimming the in-progress release notes for Django 1.10 when one of the small sections that I usually ignore jumped out at me instead:
Like Django 1.9, Django 1.10 requires Python 2.7, 3.4, or 3.5. [...]
Since I've recently been thinking about running Django on Python 3, the supported Python 3 versions caught my eye. More exactly, that it was a short list. This made me wonder what Django versions will support what Python 3 versions, and for how long.
In mid 2015, the Django project published a roadmap and said:
We will support a Python version up to and including the first Django LTS release whose security support ends after security support for that version of Python ends. For example, Python 3.3 security support ends September 2017 and Django 1.8 LTS security support ends April 2018. Therefore Django 1.8 is the last version to support Python 3.3.
So we need to look at both the Django release schedule and the Python release and support schedule. On the Django side, Django's next LTS release is '1.11 LTS', scheduled to release in April 2017 and be supported through April 2020 (and it's expected to be the last version supporting Python 2.7, since official Python 2.7 security support ends in 2020). After that is Django 2.2 in April 2019, supported through April 2022. On the Python side, the Python team appears to be doing 3.x releases roughly every 18 months (see eg PEP 494 on Python 3.6's release schedule) and giving them security support for five years after their initial release. If this is right, Python 3.4 will be supported through March 2019 and Python 3.5 through September 2020; 3.6 is expected in December 2016 (supported through December 2021) and thus 3.7 in roughly May of 2018 (supported through May 2023).
Putting all of this together, I get an expected result of:
- Python 3.4 will be supported through Django 1.11; Django 2.0
(nominally due December 2017) will drop support for it.
- Python 3.5 and 3.6 will probably be supported through Django 2.2.
Django 1.11 will almost certainly be the first release to support
Python 3.6.
- Python 3.7's exact Django support range is up in the air since at this point I'm projecting both Python and Django release schedules rather far into the misty future.
Ubuntu 14.04 LTS has Python 3.4 and Ubuntu 16.04 LTS has 3.5. Both will be supported long enough to run into the maximum likely Django version that supports their Python 3 version, although only more or less at the end of each Ubuntu LTS's lifespan.
(I'm going to have to mull over what this means for Python 3 migration plans for our Django app. Probably a real Python 3 migration attempt is closer than I thought it would be.)