tarfile module is too generous about what is considered a tar file
The Python standard library's
tarfile module has a
function that tells you whether or not some file is a tar file, or
at least is a tar file that the module can read. As is not too silly
in Python, it operates by attempting to open the file with
open() succeeds, clearly this is a good tarfile.
Unfortunately, through what is perhaps a bug, this fails to report any errors on various sorts of things that are not actually tar files. On a Unix system, the very easiest and simplest reproduction of this problem is:
>>> import tarfile >>> tarfile.open("/dev/zero", "r:")
This raises no exception and gives you back a TarFile object that will report that you have an empty tar file.
(If you leave off the '
r:', this hangs, ultimately because the
lzma module will
happily read forever from a stream of zero bytes. Unless you tell
it otherwise, the tarfile module normally tries a sequences of
decompressors on your potential tarfile, including lzma for
One specific form of thing that will cause this issue is any nominal
'tar file' that starts with 512 bytes of zero bytes (after any
decompression is applied). Since this applies to
have our handy and obviously incorrect reproduction case. There may
be other initial 512-byte blocks that will cause this; I have not
investigated the code deeply, partly because it is tangled.
I suspect that this is a bug in the
TarFile.next function, which
looks like it is missing an '
elif self.offset == 0:' clause (see
the block of code starting around here). But
whether or not this issue is a bug and will be fixed in a future
version of Python 3, it is very widespread in existing versions of
Python that are out there in the field, and so any code that cares
about this (which we have some of) needs to
cope with it.
My current hack workaround is to check whether or not the
list on the returned TarFile object is empty. This is not a documented
attribute, but it's unlikely to change and it works today (and feels
slightly less sleazy than checking whether
(For reasons beyond the scope of this entry, I have decided to slog through the effort of finding how to submit Python bug reports, unearthing my login from the last time I threw a bug report into their issue tracker, and filing a version of this as issue 36596.)
Going from a bound instance method to its class instance in Python
In response to yesterday's entry on how I feel callable classes are better than closures, a commentator suggested:
If you need something callable, why not use a bound method? They have a reference to the parent too.
This raises a question: how easy and reliable is it to go from a bound method on an instance to the instance itself?
In both Python 2 and Python 3, a bound method is an instance of a
special type (how this happens is described in my entry on how
functions become bound methods). Although
the Python 3 documentation is not explicit about it, this type is
what is described in the "Instance methods" section of the Python
3 data model.
This description of the (bound) method type officially documents
__self__ attribute, which is a reference to the original
instance that the bound method is derived from. So the answer is
that given an object
x that is passed to you as a bound method,
you can recover the actual instance as
x.__self__ and then
inspect it from there.
(In Python 2.7, there is also the
im_self attribute, which
contains the same information.)
If you want your code to check if it has a bound method, you can
This name for the type can also be used to check its
really won't tell you much; you're better off reading the "Instance
methods" section of the data model.
I'm not sure how I feel about relying on this. On the one hand, it
is officially documented and it works the same in Python 3 and
Python 2 (ignoring Python 2's
im_self and the possibility of
unbound methods on Python 2). On the other hand, this is a
attribute, and using those generally feels somewhat like I'm peeking
into implementation details. I don't know if the Python developers
consider this a stable API or something that very definitely isn't
guaranteed over the long term.
(If nothing else, now I know a little bit more about Python than I did before I decided to look this up. I was actually expecting the answer to be more obscure than it turned out to be.)
Callable class instances versus closures in Python
At first, like every operator overload, this seems like a nifty idea. And then, like most operator overload cases, we need to ask: why? Why is this better than a named method?
I wholeheartedly agree with this, and in the beginning I agreed
with the whole article. But then I began thinking about my usage
__call__ and something that the article advocated as a
replacement, and found that I partially disagree with it. To quote
If something really is nothing more than a function call with some extra arguments, then either a closure or a partial would be appropriate.
(By 'partial', the article means the use of
to construct a partially applied function.)
My view is that if you have to provide something that's callable,
a callable class is better than a closure because it's more
amenable to inspection. A class instance is a clear thing; you
can easily see what it is, what it's doing, and inspect the state
of instances (especially if you remember to give your class a
__repr__). You can
even easily give them (and their methods) docstrings, so that
help() provides helpful information about them.
None of this is true of closures (unless you go well out of your way) and only a bit of it is true of partially applied functions. Even if you go out of your way to provide a docstring for your closure function, the whole assemblage is basically an opaque blob. A partially applied function is somewhat better because the resulting object exposes some information, but it's still not as open and transparent as an object.
This becomes especially important if your callable thing is going to be called repeatedly and hold internal state. It's far easier to make this internal state visible, potentially modifiable, and above all debuggable if you're using an object than if you try to wrap all of this up inside a function (or a closure) that manipulates its internal variables. Python objects are designed to be transparent (at least by default), as peculiar as this sounds in general.
(After all, one of the usual stated purposes of objects is to encapsulate things away from the outside world.)
Callable classes are unquestionably more verbose than closures, partially applied functions, or even lambdas, and sometimes this is annoying. But I think you should use them for anything that is not trivial by itself, and maybe even for small things depending on how long the resulting callable entities are going to live and how far away they are going to propagate in your program. The result is likely to be more maintainable and more debuggable.
PS: This somewhat biases me toward providing things with the entire
instance and using
__call__ over providing a method on the
instance. If you're trying to debug something, it's harder to go
from a method to inspecting the instance it comes from. Providing
just a method is probably okay if the use is 'close' to the class
definition (eg, in the same file or the same module), because then
you can look back and forth easily. Providing the full instance is
what I'd do if I was passing the callable thing around to another
module or returning it as part of my public API.
Using default function arguments to avoid creating a class
Recently I was writing some Python code to print out Prometheus metrics about whether or not we could log in to an IMAP server. As an end to end test, this is something that can fail for a wide assortment of reasons; we can fail to connect to the IMAP server, experience a TLS error during TLS session negotiation, have the server's TLS certificate fail to validate, there could be an IMAP protocol problem, or the server could reject our login attempt. If we fail, we would like to know why for diagnostic purposes (especially, some sorts of failures are more important than others in this test). In the Prometheus world, this is traditionally done by emitting a separate metric for every different thing that can fail.
In my code, the metrics are all prepared by a single function that gets called at various points. It looks something like this:
def logingauges(host, ok, ...): [...] def logincheck(host, user, pw): try: c = ssl.create_default_context() m = imaplib.IMAP4_SSL(host=host, ssl_context=c) except ssl.CertificateError: return logingauges(host, 0, ...) except [...] [...] try: r = m.login(user, pw) [...] except imaplib.IMAP4.error: return logingauges(host, 0, ...) except [...] # success, finally. return logingauges(host, 1, ...)
When I first started writing this code, I only distinguished a
couple of different reasons that we could fail so I passed the state
of those reasons directly as additional parameters to
As the number of failure reasons rose, this got both unwieldy and
annoying, partly because adding a new failure reason required going
through all existing calls to
logingauges() to add a new parameter
to each of them.
So I gave up. I turned all of the failure reasons into keyword arguments that defaulted to 0:
def logingauges(host, ok, connerr=0, loginerr=0, certerr=0, sslerr=0, imaperr=0): [...]
Now to call
logingauges() on failure I only needed to supply an
argument for the specific failure:
return logingauges(host, 0, sslerr=1)
Adding a new failure reason became much more localized; I only had
to add a new gauge metric to
logingauges(), with a new keyword
argument, and then call it from the right place.
This strikes me as pretty much a hack. The proper way is probably
to create a class to hold all of this status information as attributes
on instances, create an instance of it at the start of
manipulate the attributes as appropriate, and return the instance
when done. The class can even have a
to_gauges() function that
generates all of the actual metrics from its current values.
(In Python 3.7, I would use a dataclass, but this has to run on Ubuntu 18.04 with Python 3.6.7, so it needs to be a boring old class.)
However, not only do I already have the version that uses default function arguments, but the class based version would require a bunch more code and bureaucracy for what is basically a simple situation in a small program. I like doing things the right way, but I'm not sure I like it that much. As it stands, the default function arguments approach is pleasantly minimal and low overhead.
(Or maybe this is considered an appropriate use of default function
arguments in Python these days. Arguments with default values are
often used to set default initial values for instance attributes,
and that is kind of what I'm doing here. One version of the class
based approach could actually look the same; instead of calling a
function, I'd return a just-created instance of my
(This is only somewhat similar to using default function arguments to merge several APIs together. Here it would be a real stretch to say that there are multiple APIs, one for each failure reason.)
The cliffs in the way of adding tests to our Django web app
Back in August of last year, I wrote that it was time for me to start adding tests to our Django web app. Since then, the number of tests I have added is zero, and in fact the amount of work that I have done on our Django web app's code is also essentially zero (partly because it hasn't needed any modifications). Part of the reason for that is that adding tests feels like make-work, even though I know perfectly well that it's not really, but another part of it is that I'm staring at two reasonably substantial cliffs in my way.
Put simply, in order to add tests that I actually want to keep, I need to learn how to write Django tests and then I need to figure out what we want to test in our Django web app (and how). Learning how to write tests means reading through the Django documentation on this, both the quick tutorial and the real documentation. Unfortunately I think that I need to read all of the documentation before I start writing any tests, and possibly even plan to throw away the first round of tests as a learning experience. Testing a Django app is not as simple as testing standalone code; there is a test database you need to construct, an internal HTTP client so that you can write end to end tests, and so on. This is complicated by the fact that by now I've forgotten a lot of my general Django knowledge and I know it, so to some extent I'm going to have to re-learn Django (and re-learn our web app's code too).
(It's possible that I can find some quick-start tests I can write more or less in isolation. There are probably some stand-alone functions that I can poke at, and perhaps even stand-alone model behavior that doesn't depend on the database having a set of interlinked base data.)
Once I sort of know how to write Django tests, I need to figure out what tests to write and how much of them. There are two general answers here that I already know; we need tests that will let us eventually move to Python 3 with some confidence that the app won't blow up, and I'd like tests that will do at least basic checks that everything is fine when we move from Django version to Django version. Tests for a Python 3 migration should probably concentrate on the points where data moves in and out of our app, following the same model I used when I thought about DWiki's Python 3 Unicode issues. Django version upgrade tests should probably start by focusing on end to end testing (eg, 'can we submit a new account request through the mock HTTP client and have it show up').
All of this adds up to a significant amount of time and work to invest before we start to see real benefits from it. As a result I've kept putting it off and finding higher priority work to do (or at least more interesting work). And I'm pretty sure I need to find a substantial chunk of time in order to get anywhere with this. To put it one way, the Django testing documentation is not something that I want to try to understand in fifteen minute blocks.
PS: It turns out that our app actually has one tiny little test that I must have added years ago as a first step. It's actually surprisingly heartening to find it there and still passing.
(As before, I'm writing this partly to push myself toward doing it. We now have less than a year to the nominal end of Python 2, which is not much time with everything going on.)
Sidebar: Our database testing issue
My impression is that a decent amount of Django apps can be tested with basically empty databases, perhaps putting in a few objects. Our app doesn't work that way; its operation sits on top of a bunch of interlinked data on things like who can sponsor accounts, how those accounts should be created, and so on. Without that data, the app does nothing (in fact it will probably fail spectacularly, since it assumes that various queries will always return some data). That means we need an entire set of at least minimal data in our test database in order to test anything much. So I need to learn all about that up front, more or less right away.
How to handle Unicode character decoding errors depends on your goals
In a comment on my entry mulling over DWiki's Python 3 Unicode issues and what I plan to do about them, Sean A. asked a very good question about how I'm planning to handle errors when decoding things from theoretical UTF-8 input:
Out of curiosity, why use backslashreplace instead of surrogateescap? (I ask because it seems to me that surrogateescape also loses no information, is guaranteed to work with any binary input, and is designed for reading unknown encodings.)
Oh. And is trivial to convert back into the original binary data.
The reason I think I want Python's 'backslashreplace' error handling instead of 'surrogateescape' is that my ultimate goal is not to reproduce the input (in all its binary glory) in my output, but to produce valid UTF-8 output (for HTML, Atom syndication feeds, and so on) even if some of the input isn't valid.
(Another option is to abort processing if the input isn't valid, which is not what I want. It would be the most conservative and safe choice in some situations.)
Given that I'm going to produce valid UTF-8 no matter what, the choice
comes down to what generates more useful results for the person
reading what was invalid UTF-8 input. You can certainly do this
with 'surrogateescape' by just encoding to straight UTF-8 using the
surrogatepass' handler, but the resulting directly encoded surrogate
characters are not going to show up as anything useful and might produce
outright errors from some things (and possibly be misinterpreted under
(With 'surrogateescape', bad characters are encoded to U+DC80 to U+DCFF, which is the 'low' part of the Unicode surrogates range. As Wikipedia notes, 'isolated surrogate code points have no general interpretation', and certainly they don't have a distinct visual representation.)
Out of all of Python's available codecs error handlers that
can be used when decoding from UTF-8 to Unicode, '
is the one that preserves the most information in a visually clear
manner while still allowing you to easily produce valid UTF-8 output
that everyone is going to accept. The '
replace' handler has the
drawback of making all invalid characters look the same and so
leaves you with no clues as to what they look like in the input,
ignore' just tosses them away entirely, leaving everyone
oblivious to the fact that bad characters were there in the first
(In some situations this makes '
ignore' the right choice, because
you may not want to give people any marker that something is wrong;
such a marker might only confuse them about something they can't
do anything about. But since I'm going to be looking at the rendered
HTML and so on myself, I want to have at least a chance to know
that DWiki is seeing bad input. And '
replace' has the advantage
that it's visible but is less peculiar and noisy than '
you might use it when you want some visual marker present that
things are a bit off, but don't want to dump a bucket of weird
backslashes on people.)
PS: This does mean that my choice here is a bit focused on what's useful for me. For me, having some representation of the actual bad characters visible in what I see gives me some idea of what to look for in the page source and what I'm going to have to fix. For other people, it's probably more going to be noise.
Two annoyances I have with Python's imaplib module
As I mentioned yesterday, I recently wrote
some code that uses the
imaplib module. In the process
of doing this, I wound up experiencing some annoyances, one of them
a traditional one and one a new one that I've only come to appreciate
The traditional annoyance is that the
imaplib module doesn't wrap
errors from other modules that it uses. This leaves you with at
least two problems. The first is that you get to try to catch a
bunch of exception classes to handle errors:
try: c = ssl.create_default_context() m = imaplib.IMAP4_SSL(host=host, ssl_context=c) [...] except (imaplib.IMAP4.error, ssl.SSLError, OSError) as e: [...]
The second is that, well, I'm not sure I'm actually catching all
of the errors that calling the
imaplib module can raise. The
module doesn't document them, and so this list is merely the ones
that I've been able to provoke in testing. This is the fundamental
flaw of not wrapping exceptions that I wrote about many years ago; by not wrapping exceptions, you make what
modules you call an implicit part of your API. Then you usually
don't document it.
I award the imaplib module bonus points for having its error exception
class accessed via an attribute on another class. I'm sure there's
a historical reason for this, but I really wish it had been cleaned
up as part of the Python 3 migration. In the current Python 3
these exception classes are actually literally classes inside the
class IMAP4: [...] class error(Exception): pass class abort(error): pass class readonly(abort): pass [...]
The other annoyance is that the
imaplib module doesn't implement
any sort of timeouts, either on individual operations or on a whole
sequence of them. If you aren't prepared to wait for potentially
very long amounts of time (if the IMAP server has something go wrong
with it), you need to add some sort of timeout yourself through
means outside of
imaplib, either something like
SIGALRM handler or through manipulating the underlying
socket to set timeouts on it (although I've read that this causes
problems, and anyway you're normally going to be trying to work
through SSL as well). For my own program I opted to go the
route, but I have the advantage that the only thing I'm doing is
IMAP. A more sophisticated program might not want to blow itself
up with a
SIGALRM just because the IMAP side of things was too
Timeouts aren't something that I used to think about when I wrote
programs that were mostly run interactively and did only one thing,
where the timeout is most sensibly imposed by the user hitting
Ctrl-C to kill the entire program. Automated testing programs and
other, similar things care a lot about timeouts, because they don't
want to hang if something goes wrong with the server. And in fact
it is possible to cause
imaplib to hang for a quite long time in
a very simple way:
m = imaplib.IMAP4_SSL(host=host, port=443)
You don't even need something that actually responds and gets as far as establishing a TLS session; it's enough for the TCP connection to be accepted. This is reasonably dangerous, because 'accept the connection and then hang' is more or less the expected behavior for a system under sufficiently high load (accepting the connection is handled in the kernel, and then the system is too loaded for the IMAP server to run).
Overall I've wound up feeling that the
imaplib module is okay for
simple, straightforward uses but it's not really a solid base for
anything more. Sure, you can probably use it, but you're also
probably going to be patching things and working around issues.
For us, using
imaplib and papering over these issues is the easiest
way forward, but if I wanted to do more I'd probably look for a third
party module (or think about switching languages).
A few notes on using SSL in Python 3 client programs
I was recently writing a Python program to check whether a test account could log into our IMAP servers and to time how long it took (as part of our new Prometheus monitoring). I used Python because it's one of our standard languages and because it includes the imaplib module, which did all of the hard work for me. As is my usual habit, I read as little of the detailed module documentation as possible and used brute force, which means that my first code looked kind of like this:
try: m = imaplib.IMAP4_SSL(host=host) m.login(user, pw) m.logout() except ....: [...]
When I tried out this code, I discovered that it was perfectly
willing to connect to our IMAP servers using the wrong host name.
At one level this is sort of okay (we're verifying that the IMAP
TLS certificates are good through other checks), but at another
it's wrong. So I went and read the module documentation with a bit
more care, where it pointed me to the ssl module's "Security
considerations" section, which
told me that in modern Python, you want to supply a SSL context and you
should normally get that context from
The default SSL context is good for a client connecting to a server.
It does certificate verification, including hostname verification,
and has officially reasonable defaults, some of which you can see
ctx.options of a created context, and also
(although the latter is rather verbose). Based on the module
documentation, Python 3 is not entirely relying on the defaults of
the underlying TLS library. However the underlying TLS library (and
its version) affects what module features are available; you need
OpenSSL 1.1.0g or later to get
It's good that people who care can carefully select ciphers, TLS versions, and so on, but it's better that this seems to have good defaults (especially if we want to move away from the server dictating cipher order). I considered explicitly disabling TLSv1 in my checker, but decided that I didn't care enough to tune the settings here (and especially to keep them tuned). Note that explicitly setting a minimum version is a dangerous operation over the long term, because it means that someday you're lowering the minimum version instead of raising it.
(Today, for example, you might set the minimum version to TLS v1.2
and increase your security over the defaults. Then in five years,
the default version could change to TLS v1.3 and now your unchanged
code is worse than the defaults. Fortunately the TLS version constants
do compare properly so far, so you can write code that uses
to do it more or less right.)
Python 2.7 also has SSL contexts and
starting in 2.7.9. However, use of SSL contexts is less widespread
than it is in Python 3 (for instance the Python 2
imaplib doesn't seem to
support them), so I think it's clear you want to use Python 3 here
if you have a choice.
(It seems a little bit odd to still be thinking about Python 2 now that it's less than a year to it being officially unsupported by the Python developers, but it's not going away any time soon and there are probably people writing new code in it.)
I have somewhat mixed feelings about Python 3's
socket module errors
Many years ago I wrote about some things that irritated me about
socket module. One of my
complaints was that many instances of
socket.error should actually
OSError instead of a separate type, because that's
what they really were. Today I was reading AdamW’s Debugging
Adventures: Python 3 Porting 201
(via), where I discovered in a passing
mention that in Python 3,
socket.error is a deprecated alias of
(Well, from Python 3.3 onwards, due to PEP 3151.)
On the one hand, this is a change that I cautiously approve of.
Many socket errors are just operating system errors, especially on
Unix. On the other hand, in some ways this makes
socket.gaierror feel worse. Both of these violate the rule
of leaving IOError and OSError alone, because
they are subclasses of
OSError that do not have authentic
values and are not quite genuine OS errors in the same way (they
are errors from the C library, but they don't come from
They do have
strerror fields, which is something, but
then I think all subclasses of
OSError do these days.
Somewhat to my surprise, when I looked at the Python 2
module I discovered
socket.error is now a subclass of
IOError (since Python
2.6, which in practice means 'on any system with Python 2 that you
actually want to use'). Python 2 also has the same issue where
socket.gaierror are subclasses of
but are not real operating system errors.
Unfortunately for my feelings about leaving OSError alone, the
current situation in the
socket module is probably the best
pragmatic tradeoff. Since the module has high level interfaces that
can fail in multiple ways that result in different types of errors,
in practice people want to be able to just catch one overall error
and be done with it, which means that
needs to be a subclass of
socket.error. When you combine this
socket.error really being some form of
OSError, you arrive
at the current state of affairs.
I've decided that I don't have a strong opinion on
changing from being a subclass of IOError/OSError to being an alias
for it. I can imagine Python code that might want to use
a high level, call both socket functions and other OS functions
within that high level
try, and distinguish between the two sources
of errors, which is now impossible in Python 3, but I'm not sure
that this is a desirable pattern. I don't think I have anything like
this in my own Python code, but it's something that I should keep an
eye out for as I convert things over to Python 3.
(I do have some Python 2 code that catches both
EnvironmentError, but fortunately it treats them the same.)
Thinking about DWiki's Python 3 Unicode issues
DWiki (the code behind this blog) is currently Python 2, and it has to move to Python 3 someday, even if I'm in no hurry to make that move. The end of 2018, with only a year of official Python 2 support remaining, seems like a good time to take stock of what I expect to be the biggest aspect of that move, which is character set and Unicode issues (this is also the big issue I ignored when I got DWiki tentatively running under Python 3 a few years ago).
The current Python 2 version of DWiki basically ignores encoding issues. It allows you to specify the character set the HTML will say, but it pretty much treats everything as bytes and makes no attempts to validate that your content is actually valid in the character set you've claimed. This is not viable in Python 3 for various reasons, including that it's not how the Python 3 version of WSGI works (as covered in PEP 3333). Considering Unicode issues for a Python 3 version of DWiki means thinking about everywhere that DWiki reads and writes data from, and deciding what encoding that data is in (and then properly inserting error checks to handle when that data is not actually properly encoded).
The primary source of text data for DWiki is the text of pages and
comments. Here in 2018, the only sensible encoding for these is
UTF-8, and I should probably just hardcode that assumption into
reading them from the filesystem (and writing comments out to the
filesystem). Relying on Python's system encoding setting, whatever
it is, seems not like a good idea, and I don't think this should
be settable in DWiki's configuration file. UTF-8 also has the
advantage for writing things out that it's a universal encoder; you
can encode any Unicode
str to UTF-8, which isn't true of all
Another source of text data is the names of files and directories in the directory hierarchy that DWiki serves content from; these will generally appear in links and various other places. Again, I think the only sensible decision in 2018 is to declare that all filenames have to be UTF-8 and undefined things happen if they aren't. DWiki will do its best to do something sensible, but it can only do so much. Since these names propagate through to links and so on, I will have to make sure that UTF-8 in links is properly encoded.
(In general, I probably want to use the '
handling option when decoding to Unicode, because that's the option
that both produces correct results and preserves as much information
as possible. Since this introduces extra backslashes, I'll have to
make sure they're all handled properly.)
For HTML output, once again the only sensible encoding is UTF-8. I'll take out the current configuration file option and just hard-code it, so the internal Unicode HTML content that's produced by rendering DWikiText to HTML will be encoded to UTF-8 bytestrings. I'll have to make sure that I consistently calculate my ETag values from the same version of the content, probably the bytestring version (the current code calculates the ETag hash very late in the process).
DWiki interacts with the HTTP world through WSGI, although it's all
my own WSGI implementation in a normal setup. PEP 3333 clarifies
WSGI for Python 3, and it specifies two sides of things here; what
types are used where, and
some information on header encoding. For
output, generally my header values will be in ISO-8859-1; however,
for some redirections, the
Location: header might include UTF-8
derived from filenames, and I'll need to encode it properly. Handling
incoming HTTP headers and bodies is going to be more annoying and
perhaps more challenging; people and programs may well send me
incorrectly formed headers that aren't properly encoded, and for
POST requests (for example, for comments) there may be various
encodings in use and also the possibility that the data is not
correctly encoded (eg it claims to be UTF-8 but doesn't decode
properly). In theory I might be able to force people to use UTF-8
on comment submissions, and probably most
browsers would accept that.
Since I don't actually know what happens in the wild here, probably a sensible first pass Python 3 implementation should log and reject with a HTTP error any comment submission that is not in UTF-8, or any HTTP request with headers that don't properly decode. If I see any significant quantity of them that appears legitimate, I can add code that tries to handle the situation.
(Possibly I should start by adding code to the current Python 2
version of DWiki that looks for this situation and logs information
about it. That would give me a year or two of data at a minimum.
I should also add an
accept-charset attribute to the current
DWiki has on-disk caches of data created with Python's pickle module. I'll have to make sure that the code reads and writes these objects using bytestrings and in binary mode, without trying to encode or decode it (in my current code, I read and write the pickled data myself, not through the pickle module).
The current DWiki code does some escaping of bad characters in text, because at one point control characters kept creeping in and blowing up my Atom feeds. This escaping should stay in a Python 3 Unicode world, where it will become more correct and reliable (currently it really operates on bytes, which has various issues).
Since in real life most things are properly encoded and even mostly
ASCII, mistakes in all of this might lurk undetected for some time.
To deal with this, I should set up two torture test environments
for DWiki, one where there is UTF-8 everywhere I can think of
(including in file and directory names) and one where there is
incorrectly encoded UTF-8 everywhere I can think of (or things just
not encoded as UTF-8, but instead Latin-1 or something). Running
DWiki against both of these would smoke out many problems and areas
I've missed. I should also put together some HTTP tests with badly
encoded headers and comment
POST bodies and so on, although I'm
not sure what tools are available to create deliberately incorrect
HTTP requests like that.
All of this is clearly going to be a long term project and I've probably missed some areas, but at least I'm starting to think about it a bit. Also, I now have some preliminary steps I can take while DWiki is still a Python 2 program (although whether I'll get around to them is another question, as it always is these days with work on DWiki's code).
PS: Rereading my old entry has also reminded me that there's DWiki's logging messages as well. I'll just declare those to be UTF-8 and be done with it, since I can turn any Unicode into UTF-8. The rest of the log file may or may not be UTF-8, but I really don't care. Fortunately DWiki doesn't use syslog (although I've already wrestled with that issue).
Sidebar: DWiki's rendering templates and static file serving
DWiki has an entire home-grown template system that's used as part of the processing model. These templates should be declared to be UTF-8 and loaded as such, with it being a fatal internal error if they fail to decode properly.
DWiki can also be configured to serve static files. In Python 3, these static files should be loaded uninterpreted as (binary mode) bytestrings and served back out that way, especially since they can be used for things like images (which are binary data to start with). Unfortunately this is going to require some code changes in DWiki's storage layer, because right now these static files are loaded from disk with the same code that is also used to load DWikiText pages, which have to be decoded to Unicode as they're loaded.