Wandering Thoughts archives

2020-01-30

Some notes on Python's email.header.decode_header()

I've recently been investigating some oddly encoded MIME Content-Disposition headers that were turned up by our Python-based system for recording email attachment type information. As part of this I wanted to decode those RFC 2047 encoded-words, obviously using Python because that's what we were using to start with.

Under normal circumstances, you're apparently supposed to read in a whole email message into an email.message.EmailMessage and then dig through it. I did not have a whole email message; I didn't even have a whole isolated MIME header. I just had a chunk of RFC 2047 encoded data to decode. The first thing to know is that if you care about good handling of RFC 2047 encoded things, you should be using Python 3. I had an existing old Python 2 program using email.header.decode_header(), and it turned out to mis-decoded a header value that Python 3 handled fine using the same function.

Now that I've actually read all of the documentation for email.header, how you should use it to generate a decoded form is probably to take advantage of all of its convenience functions, by explicitly decoding the header, then making a email.header.Header instance, then getting the string form of it:

dcd = email.header.decode_header(headerstr)
hdr = email.header.make_header(dcd)
return str(hdr)

(This omits error checking. As is documented in the docstring for decode_header but not in the module's documentation, it can raise at least email.errors.HeaderParseError in some situations, such as a base64 decoding problem.)

This makes the module do all the hard work of decoding the somewhat arcane results of calling decode_header. But let's assume that you first wrote your program to directly interpret and use those results, and you'd like to know what you get (in Python 3, which is different from Python 2). What you get back from decode_header is a list of tuples:

[(data1, charset1), (data2, charset2), ...]

Often the list will have only one tuple for various reasons beyond the scope of this entry, but it's always possible to get multiple ones (and in different character sets). There are three main cases of what the tuples can be:

  • the data is a bytestring and the character set is a non-blank normal character set (as a Python 3 string). To produce Unicode, you need to do 'data1.decode(charset1)'. The error handling policy you want to use on decoding is up to you.

  • the data is a Python string and the character set is None. This is what you get back if the entire header is not encoded at all, and probably in some other cases. You can use the data as is, since it's already a string.

  • the data is a bytestring and the character set is None. This is what you get back for a non-encoded portion of a header with some encoded portion (and possibly in other circumstances). In theory this is pure ASCII, but don't hold your breath; you probably want to decode this to a string as UTF-8, perhaps with some liberal error handling policy.

If the RFC 2047 encoding is sufficiently mangled in the right way, you may get back a tuple with a character set of '' (a blank string) instead of the exception that you may have been expecting. On the one hand this will make .decode fail; on the other hand, it fails with an 'unknown encoding' error and you can get that if people just claim their header is in some weird character encoding Python has never heard of before, so you already need to handle it.

All of this is a mess. I suggest that you just call make_header, because then you get to file bugs with the Python people if it doesn't work (and doesn't raise a clear error exception), as opposed to patching your own code for yet more special cases.

In general, unfortunately, the email.header module is probably not designed to deal well with arbitrary input from the general Internet; I suspect that it tacitly assumes that it's mostly dealing with well-formed email. There are a lot of mail-generating programs out there with bugs and generous interpretations of what they can get away with, especially if you have to deal with spam and malware (which are often generated by programs with more than the usual number of bugs).

python/DecodeEmailHeaderNotes written at 23:35:10; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.