Some notes on Python's
I've recently been investigating some oddly encoded MIME Content-Disposition headers that were turned up by our Python-based system for recording email attachment type information. As part of this I wanted to decode those RFC 2047 encoded-words, obviously using Python because that's what we were using to start with.
Under normal circumstances, you're apparently supposed to read in
a whole email message into an
and then dig through it. I did not have a whole email message; I
didn't even have a whole isolated MIME header. I just had a chunk
of RFC 2047 encoded data to decode. The first thing to know is that
if you care about good handling of RFC 2047 encoded things, you
should be using Python 3. I had an existing old Python 2 program
and it turned out to mis-decoded a header value that Python 3 handled fine
using the same function.
Now that I've actually read all of the documentation for email.header, how you should use it to generate a decoded form is probably to take advantage of all of its convenience functions, by explicitly decoding the header, then making a email.header.Header instance, then getting the string form of it:
dcd = email.header.decode_header(headerstr) hdr = email.header.make_header(dcd) return str(hdr)
(This omits error checking. As is documented in the docstring for decode_header but not in the module's documentation, it can raise at least email.errors.HeaderParseError in some situations, such as a base64 decoding problem.)
This makes the module do all the hard work of decoding the somewhat
arcane results of calling
decode_header. But let's assume that
you first wrote your program to directly interpret and use those
results, and you'd like to know what you get (in Python 3, which is
different from Python 2). What you get back from decode_header
is a list of tuples:
[(data1, charset1), (data2, charset2), ...]
Often the list will have only one tuple for various reasons beyond the scope of this entry, but it's always possible to get multiple ones (and in different character sets). There are three main cases of what the tuples can be:
- the data is a bytestring and the character set is a non-blank normal
character set (as a Python 3 string). To produce Unicode, you need
to do '
data1.decode(charset1)'. The error handling policy you want to use on decoding is up to you.
- the data is a Python string and the character set is
None. This is what you get back if the entire header is not encoded at all, and probably in some other cases. You can use the data as is, since it's already a string.
- the data is a bytestring and the character set is
None. This is what you get back for a non-encoded portion of a header with some encoded portion (and possibly in other circumstances). In theory this is pure ASCII, but don't hold your breath; you probably want to decode this to a string as UTF-8, perhaps with some liberal error handling policy.
If the RFC 2047 encoding is sufficiently mangled in the right way,
you may get back a tuple with a character set of '' (a blank string)
instead of the exception that you may have been expecting. On the
one hand this will make
.decode fail; on the other hand, it fails
with an 'unknown encoding' error and you can get that if people
just claim their header is in some weird character encoding Python
has never heard of before, so you already need to handle it.
All of this is a mess. I suggest that you just call
because then you get to file bugs with the Python people if it doesn't
work (and doesn't raise a clear error exception), as opposed to patching
your own code for yet more special cases.
In general, unfortunately, the email.header module is probably not designed to deal well with arbitrary input from the general Internet; I suspect that it tacitly assumes that it's mostly dealing with well-formed email. There are a lot of mail-generating programs out there with bugs and generous interpretations of what they can get away with, especially if you have to deal with spam and malware (which are often generated by programs with more than the usual number of bugs).