Wandering Thoughts


The tarfile module is too generous about what is considered a tar file

The Python standard library's tarfile module has a tarfile.is_tarfile function that tells you whether or not some file is a tar file, or at least is a tar file that the module can read. As is not too silly in Python, it operates by attempting to open the file with tarfile.open; if open() succeeds, clearly this is a good tarfile.

Unfortunately, through what is perhaps a bug, this fails to report any errors on various sorts of things that are not actually tar files. On a Unix system, the very easiest and simplest reproduction of this problem is:

>>> import tarfile
>>> tarfile.open("/dev/zero", "r:")

This raises no exception and gives you back a TarFile object that will report that you have an empty tar file.

(If you leave off the 'r:', this hangs, ultimately because the lzma module will happily read forever from a stream of zero bytes. Unless you tell it otherwise, the tarfile module normally tries a sequences of decompressors on your potential tarfile, including lzma for .xz files.)

One specific form of thing that will cause this issue is any nominal 'tar file' that starts with 512 bytes of zero bytes (after any decompression is applied). Since this applies to /dev/zero, we have our handy and obviously incorrect reproduction case. There may be other initial 512-byte blocks that will cause this; I have not investigated the code deeply, partly because it is tangled.

I suspect that this is a bug in the TarFile.next function, which looks like it is missing an 'elif self.offset == 0:' clause (see the block of code starting around here). But whether or not this issue is a bug and will be fixed in a future version of Python 3, it is very widespread in existing versions of Python that are out there in the field, and so any code that cares about this (which we have some of) needs to cope with it.

My current hack workaround is to check whether or not the .members list on the returned TarFile object is empty. This is not a documented attribute, but it's unlikely to change and it works today (and feels slightly less sleazy than checking whether .firstmember is None).

(For reasons beyond the scope of this entry, I have decided to slog through the effort of finding how to submit Python bug reports, unearthing my login from the last time I threw a bug report into their issue tracker, and filing a version of this as issue 36596.)

TarfileTooGenerous written at 22:12:58; Add Comment


Going from a bound instance method to its class instance in Python

In response to yesterday's entry on how I feel callable classes are better than closures, a commentator suggested:

If you need something callable, why not use a bound method? They have a reference to the parent too.

This raises a question: how easy and reliable is it to go from a bound method on an instance to the instance itself?

In both Python 2 and Python 3, a bound method is an instance of a special type (how this happens is described in my entry on how functions become bound methods). Although the Python 3 documentation is not explicit about it, this type is what is described in the "Instance methods" section of the Python 3 data model. This description of the (bound) method type officially documents the __self__ attribute, which is a reference to the original instance that the bound method is derived from. So the answer is that given an object x that is passed to you as a bound method, you can recover the actual instance as x.__self__ and then inspect it from there.

(In Python 2.7, there is also the im_self attribute, which contains the same information.)

If you want your code to check if it has a bound method, you can use isinstance() with types.MethodType. This name for the type can also be used to check its help(), which really won't tell you much; you're better off reading the "Instance methods" section of the data model.

I'm not sure how I feel about relying on this. On the one hand, it is officially documented and it works the same in Python 3 and Python 2 (ignoring Python 2's im_self and the possibility of unbound methods on Python 2). On the other hand, this is a __ attribute, and using those generally feels somewhat like I'm peeking into implementation details. I don't know if the Python developers consider this a stable API or something that very definitely isn't guaranteed over the long term.

(If nothing else, now I know a little bit more about Python than I did before I decided to look this up. I was actually expecting the answer to be more obscure than it turned out to be.)

BoundMethodToInstance written at 23:57:39; Add Comment


Callable class instances versus closures in Python

Recently I read Don't Make It Callable (via), which advocates avoiding having your class instances be callable (by __call__ on your classes). Let me quote its fundamental thesis on using __call__:

At first, like every operator overload, this seems like a nifty idea. And then, like most operator overload cases, we need to ask: why? Why is this better than a named method?

I wholeheartedly agree with this, and in the beginning I agreed with the whole article. But then I began thinking about my usage of __call__ and something that the article advocated as a replacement, and found that I partially disagree with it. To quote it again:

If something really is nothing more than a function call with some extra arguments, then either a closure or a partial would be appropriate.

(By 'partial', the article means the use of functools.partial to construct a partially applied function.)

My view is that if you have to provide something that's callable, a callable class is better than a closure because it's more amenable to inspection. A class instance is a clear thing; you can easily see what it is, what it's doing, and inspect the state of instances (especially if you remember to give your class a useful __str__ or __repr__). You can even easily give them (and their methods) docstrings, so that help() provides helpful information about them.

None of this is true of closures (unless you go well out of your way) and only a bit of it is true of partially applied functions. Even if you go out of your way to provide a docstring for your closure function, the whole assemblage is basically an opaque blob. A partially applied function is somewhat better because the resulting object exposes some information, but it's still not as open and transparent as an object.

This becomes especially important if your callable thing is going to be called repeatedly and hold internal state. It's far easier to make this internal state visible, potentially modifiable, and above all debuggable if you're using an object than if you try to wrap all of this up inside a function (or a closure) that manipulates its internal variables. Python objects are designed to be transparent (at least by default), as peculiar as this sounds in general.

(After all, one of the usual stated purposes of objects is to encapsulate things away from the outside world.)

Callable classes are unquestionably more verbose than closures, partially applied functions, or even lambdas, and sometimes this is annoying. But I think you should use them for anything that is not trivial by itself, and maybe even for small things depending on how long the resulting callable entities are going to live and how far away they are going to propagate in your program. The result is likely to be more maintainable and more debuggable.

PS: This somewhat biases me toward providing things with the entire instance and using __call__ over providing a method on the instance. If you're trying to debug something, it's harder to go from a method to inspecting the instance it comes from. Providing just a method is probably okay if the use is 'close' to the class definition (eg, in the same file or the same module), because then you can look back and forth easily. Providing the full instance is what I'd do if I was passing the callable thing around to another module or returning it as part of my public API.

CallableClassVsClosure written at 22:59:13; Add Comment


Using default function arguments to avoid creating a class

Recently I was writing some Python code to print out Prometheus metrics about whether or not we could log in to an IMAP server. As an end to end test, this is something that can fail for a wide assortment of reasons; we can fail to connect to the IMAP server, experience a TLS error during TLS session negotiation, have the server's TLS certificate fail to validate, there could be an IMAP protocol problem, or the server could reject our login attempt. If we fail, we would like to know why for diagnostic purposes (especially, some sorts of failures are more important than others in this test). In the Prometheus world, this is traditionally done by emitting a separate metric for every different thing that can fail.

In my code, the metrics are all prepared by a single function that gets called at various points. It looks something like this:

def logingauges(host, ok, ...):

def logincheck(host, user, pw):
    c = ssl.create_default_context()
    m = imaplib.IMAP4_SSL(host=host, ssl_context=c)
  except ssl.CertificateError:
    return logingauges(host, 0, ...)
  except [...]

    r = m.login(user, pw)
  except imaplib.IMAP4.error:
    return logingauges(host, 0, ...)
  except [...]

  # success, finally.
  return logingauges(host, 1, ...)

When I first started writing this code, I only distinguished a couple of different reasons that we could fail so I passed the state of those reasons directly as additional parameters to logingauges(). As the number of failure reasons rose, this got both unwieldy and annoying, partly because adding a new failure reason required going through all existing calls to logingauges() to add a new parameter to each of them.

So I gave up. I turned all of the failure reasons into keyword arguments that defaulted to 0:

def logingauges(host, ok,
                connerr=0, loginerr=0, certerr=0,
                sslerr=0, imaperr=0):

Now to call logingauges() on failure I only needed to supply an argument for the specific failure:

  return logingauges(host, 0, sslerr=1)

Adding a new failure reason became much more localized; I only had to add a new gauge metric to logingauges(), with a new keyword argument, and then call it from the right place.

This strikes me as pretty much a hack. The proper way is probably to create a class to hold all of this status information as attributes on instances, create an instance of it at the start of logincheck(), manipulate the attributes as appropriate, and return the instance when done. The class can even have a to_gauges() function that generates all of the actual metrics from its current values.

(In Python 3.7, I would use a dataclass, but this has to run on Ubuntu 18.04 with Python 3.6.7, so it needs to be a boring old class.)

However, not only do I already have the version that uses default function arguments, but the class based version would require a bunch more code and bureaucracy for what is basically a simple situation in a small program. I like doing things the right way, but I'm not sure I like it that much. As it stands, the default function arguments approach is pleasantly minimal and low overhead.

(Or maybe this is considered an appropriate use of default function arguments in Python these days. Arguments with default values are often used to set default initial values for instance attributes, and that is kind of what I'm doing here. One version of the class based approach could actually look the same; instead of calling a function, I'd return a just-created instance of my IMAPStatus class.)

(This is only somewhat similar to using default function arguments to merge several APIs together. Here it would be a real stretch to say that there are multiple APIs, one for each failure reason.)

DefaultArgumentsAvoidClass written at 22:27:44; Add Comment


The cliffs in the way of adding tests to our Django web app

Back in August of last year, I wrote that it was time for me to start adding tests to our Django web app. Since then, the number of tests I have added is zero, and in fact the amount of work that I have done on our Django web app's code is also essentially zero (partly because it hasn't needed any modifications). Part of the reason for that is that adding tests feels like make-work, even though I know perfectly well that it's not really, but another part of it is that I'm staring at two reasonably substantial cliffs in my way.

Put simply, in order to add tests that I actually want to keep, I need to learn how to write Django tests and then I need to figure out what we want to test in our Django web app (and how). Learning how to write tests means reading through the Django documentation on this, both the quick tutorial and the real documentation. Unfortunately I think that I need to read all of the documentation before I start writing any tests, and possibly even plan to throw away the first round of tests as a learning experience. Testing a Django app is not as simple as testing standalone code; there is a test database you need to construct, an internal HTTP client so that you can write end to end tests, and so on. This is complicated by the fact that by now I've forgotten a lot of my general Django knowledge and I know it, so to some extent I'm going to have to re-learn Django (and re-learn our web app's code too).

(It's possible that I can find some quick-start tests I can write more or less in isolation. There are probably some stand-alone functions that I can poke at, and perhaps even stand-alone model behavior that doesn't depend on the database having a set of interlinked base data.)

Once I sort of know how to write Django tests, I need to figure out what tests to write and how much of them. There are two general answers here that I already know; we need tests that will let us eventually move to Python 3 with some confidence that the app won't blow up, and I'd like tests that will do at least basic checks that everything is fine when we move from Django version to Django version. Tests for a Python 3 migration should probably concentrate on the points where data moves in and out of our app, following the same model I used when I thought about DWiki's Python 3 Unicode issues. Django version upgrade tests should probably start by focusing on end to end testing (eg, 'can we submit a new account request through the mock HTTP client and have it show up').

All of this adds up to a significant amount of time and work to invest before we start to see real benefits from it. As a result I've kept putting it off and finding higher priority work to do (or at least more interesting work). And I'm pretty sure I need to find a substantial chunk of time in order to get anywhere with this. To put it one way, the Django testing documentation is not something that I want to try to understand in fifteen minute blocks.

PS: It turns out that our app actually has one tiny little test that I must have added years ago as a first step. It's actually surprisingly heartening to find it there and still passing.

(As before, I'm writing this partly to push myself toward doing it. We now have less than a year to the nominal end of Python 2, which is not much time with everything going on.)

Sidebar: Our database testing issue

My impression is that a decent amount of Django apps can be tested with basically empty databases, perhaps putting in a few objects. Our app doesn't work that way; its operation sits on top of a bunch of interlinked data on things like who can sponsor accounts, how those accounts should be created, and so on. Without that data, the app does nothing (in fact it will probably fail spectacularly, since it assumes that various queries will always return some data). That means we need an entire set of at least minimal data in our test database in order to test anything much. So I need to learn all about that up front, more or less right away.

DjangoMyTestingCliffs written at 00:20:30; Add Comment


How to handle Unicode character decoding errors depends on your goals

In a comment on my entry mulling over DWiki's Python 3 Unicode issues and what I plan to do about them, Sean A. asked a very good question about how I'm planning to handle errors when decoding things from theoretical UTF-8 input:

Out of curiosity, why use backslashreplace instead of surrogateescap? (I ask because it seems to me that surrogateescape also loses no information, is guaranteed to work with any binary input, and is designed for reading unknown encodings.)

Oh. And is trivial to convert back into the original binary data.

The reason I think I want Python's 'backslashreplace' error handling instead of 'surrogateescape' is that my ultimate goal is not to reproduce the input (in all its binary glory) in my output, but to produce valid UTF-8 output (for HTML, Atom syndication feeds, and so on) even if some of the input isn't valid.

(Another option is to abort processing if the input isn't valid, which is not what I want. It would be the most conservative and safe choice in some situations.)

Given that I'm going to produce valid UTF-8 no matter what, the choice comes down to what generates more useful results for the person reading what was invalid UTF-8 input. You can certainly do this with 'surrogateescape' by just encoding to straight UTF-8 using the 'surrogatepass' handler, but the resulting directly encoded surrogate characters are not going to show up as anything useful and might produce outright errors from some things (and possibly be misinterpreted under some circumstances).

(With 'surrogateescape', bad characters are encoded to U+DC80 to U+DCFF, which is the 'low' part of the Unicode surrogates range. As Wikipedia notes, 'isolated surrogate code points have no general interpretation', and certainly they don't have a distinct visual representation.)

Out of all of Python's available codecs error handlers that can be used when decoding from UTF-8 to Unicode, 'backslashreplace' is the one that preserves the most information in a visually clear manner while still allowing you to easily produce valid UTF-8 output that everyone is going to accept. The 'replace' handler has the drawback of making all invalid characters look the same and so leaves you with no clues as to what they look like in the input, and 'ignore' just tosses them away entirely, leaving everyone oblivious to the fact that bad characters were there in the first place.

(In some situations this makes 'ignore' the right choice, because you may not want to give people any marker that something is wrong; such a marker might only confuse them about something they can't do anything about. But since I'm going to be looking at the rendered HTML and so on myself, I want to have at least a chance to know that DWiki is seeing bad input. And 'replace' has the advantage that it's visible but is less peculiar and noisy than 'backslashreplace'; you might use it when you want some visual marker present that things are a bit off, but don't want to dump a bucket of weird backslashes on people.)

PS: This does mean that my choice here is a bit focused on what's useful for me. For me, having some representation of the actual bad characters visible in what I see gives me some idea of what to look for in the page source and what I'm going to have to fix. For other people, it's probably more going to be noise.

UnicodeDecodeErrorChoice written at 01:19:32; Add Comment


Two annoyances I have with Python's imaplib module

As I mentioned yesterday, I recently wrote some code that uses the imaplib module. In the process of doing this, I wound up experiencing some annoyances, one of them a traditional one and one a new one that I've only come to appreciate recently.

The traditional annoyance is that the imaplib module doesn't wrap errors from other modules that it uses. This leaves you with at least two problems. The first is that you get to try to catch a bunch of exception classes to handle errors:

  c = ssl.create_default_context()
  m = imaplib.IMAP4_SSL(host=host, ssl_context=c)
except (imaplib.IMAP4.error, ssl.SSLError, OSError) as e:

The second is that, well, I'm not sure I'm actually catching all of the errors that calling the imaplib module can raise. The module doesn't document them, and so this list is merely the ones that I've been able to provoke in testing. This is the fundamental flaw of not wrapping exceptions that I wrote about many years ago; by not wrapping exceptions, you make what modules you call an implicit part of your API. Then you usually don't document it.

I award the imaplib module bonus points for having its error exception class accessed via an attribute on another class. I'm sure there's a historical reason for this, but I really wish it had been cleaned up as part of the Python 3 migration. In the current Python 3 source, these exception classes are actually literally classes inside the IMAP4 class:

class IMAP4:
  class error(Exception): pass
  class abort(error): pass
  class readonly(abort): pass

The other annoyance is that the imaplib module doesn't implement any sort of timeouts, either on individual operations or on a whole sequence of them. If you aren't prepared to wait for potentially very long amounts of time (if the IMAP server has something go wrong with it), you need to add some sort of timeout yourself through means outside of imaplib, either something like signal.setitimer() with a SIGALRM handler or through manipulating the underlying socket to set timeouts on it (although I've read that this causes problems, and anyway you're normally going to be trying to work through SSL as well). For my own program I opted to go the SIGALRM route, but I have the advantage that the only thing I'm doing is IMAP. A more sophisticated program might not want to blow itself up with a SIGALRM just because the IMAP side of things was too slow.

Timeouts aren't something that I used to think about when I wrote programs that were mostly run interactively and did only one thing, where the timeout is most sensibly imposed by the user hitting Ctrl-C to kill the entire program. Automated testing programs and other, similar things care a lot about timeouts, because they don't want to hang if something goes wrong with the server. And in fact it is possible to cause imaplib to hang for a quite long time in a very simple way:

m = imaplib.IMAP4_SSL(host=host, port=443)

You don't even need something that actually responds and gets as far as establishing a TLS session; it's enough for the TCP connection to be accepted. This is reasonably dangerous, because 'accept the connection and then hang' is more or less the expected behavior for a system under sufficiently high load (accepting the connection is handled in the kernel, and then the system is too loaded for the IMAP server to run).

Overall I've wound up feeling that the imaplib module is okay for simple, straightforward uses but it's not really a solid base for anything more. Sure, you can probably use it, but you're also probably going to be patching things and working around issues. For us, using imaplib and papering over these issues is the easiest way forward, but if I wanted to do more I'd probably look for a third party module (or think about switching languages).

ImaplibTwoAnnoyances written at 00:33:00; Add Comment


A few notes on using SSL in Python 3 client programs

I was recently writing a Python program to check whether a test account could log into our IMAP servers and to time how long it took (as part of our new Prometheus monitoring). I used Python because it's one of our standard languages and because it includes the imaplib module, which did all of the hard work for me. As is my usual habit, I read as little of the detailed module documentation as possible and used brute force, which means that my first code looked kind of like this:

  m = imaplib.IMAP4_SSL(host=host)
  m.login(user, pw)
except ....:

When I tried out this code, I discovered that it was perfectly willing to connect to our IMAP servers using the wrong host name. At one level this is sort of okay (we're verifying that the IMAP TLS certificates are good through other checks), but at another it's wrong. So I went and read the module documentation with a bit more care, where it pointed me to the ssl module's "Security considerations" section, which told me that in modern Python, you want to supply a SSL context and you should normally get that context from ssl.create_default_context().

The default SSL context is good for a client connecting to a server. It does certificate verification, including hostname verification, and has officially reasonable defaults, some of which you can see in ctx.options of a created context, and also ctx.get_ciphers() (although the latter is rather verbose). Based on the module documentation, Python 3 is not entirely relying on the defaults of the underlying TLS library. However the underlying TLS library (and its version) affects what module features are available; you need OpenSSL 1.1.0g or later to get SSLContext.minimum_version, for example.

It's good that people who care can carefully select ciphers, TLS versions, and so on, but it's better that this seems to have good defaults (especially if we want to move away from the server dictating cipher order). I considered explicitly disabling TLSv1 in my checker, but decided that I didn't care enough to tune the settings here (and especially to keep them tuned). Note that explicitly setting a minimum version is a dangerous operation over the long term, because it means that someday you're lowering the minimum version instead of raising it.

(Today, for example, you might set the minimum version to TLS v1.2 and increase your security over the defaults. Then in five years, the default version could change to TLS v1.3 and now your unchanged code is worse than the defaults. Fortunately the TLS version constants do compare properly so far, so you can write code that uses max() to do it more or less right.)

Python 2.7 also has SSL contexts and ssl.create_default_context(), starting in 2.7.9. However, use of SSL contexts is less widespread than it is in Python 3 (for instance the Python 2 imaplib doesn't seem to support them), so I think it's clear you want to use Python 3 here if you have a choice.

(It seems a little bit odd to still be thinking about Python 2 now that it's less than a year to it being officially unsupported by the Python developers, but it's not going away any time soon and there are probably people writing new code in it.)

Python3SSLInClients written at 01:53:36; Add Comment


I have somewhat mixed feelings about Python 3's socket module errors

Many years ago I wrote about some things that irritated me about Python 2's socket module. One of my complaints was that many instances of socket.error should actually be IOError or OSError instead of a separate type, because that's what they really were. Today I was reading AdamW’s Debugging Adventures: Python 3 Porting 201 (via), where I discovered in a passing mention that in Python 3, socket.error is a deprecated alias of OSError.

(Well, from Python 3.3 onwards, due to PEP 3151.)

On the one hand, this is a change that I cautiously approve of. Many socket errors are just operating system errors, especially on Unix. On the other hand, in some ways this makes socket.herror and socket.gaierror feel worse. Both of these violate the rule of leaving IOError and OSError alone, because they are subclasses of OSError that do not have authentic errno values and are not quite genuine OS errors in the same way (they are errors from the C library, but they don't come from errno). They do have errno and strerror fields, which is something, but then I think all subclasses of OSError do these days.

Somewhat to my surprise, when I looked at the Python 2 socket module I discovered that socket.error is now a subclass of IOError (since Python 2.6, which in practice means 'on any system with Python 2 that you actually want to use'). Python 2 also has the same issue where socket.herror and socket.gaierror are subclasses of socket.error but are not real operating system errors.

Unfortunately for my feelings about leaving OSError alone, the current situation in the socket module is probably the best pragmatic tradeoff. Since the module has high level interfaces that can fail in multiple ways that result in different types of errors, in practice people want to be able to just catch one overall error and be done with it, which means that socket.gaierror really needs to be a subclass of socket.error. When you combine this with socket.error really being some form of OSError, you arrive at the current state of affairs.

I've decided that I don't have a strong opinion on socket.error changing from being a subclass of IOError/OSError to being an alias for it. I can imagine Python code that might want to use try at a high level, call both socket functions and other OS functions within that high level try, and distinguish between the two sources of errors, which is now impossible in Python 3, but I'm not sure that this is a desirable pattern. I don't think I have anything like this in my own Python code, but it's something that I should keep an eye out for as I convert things over to Python 3.

(I do have some Python 2 code that catches both socket.error and EnvironmentError, but fortunately it treats them the same.)

Python3SocketErrors written at 00:27:21; Add Comment


Thinking about DWiki's Python 3 Unicode issues

DWiki (the code behind this blog) is currently Python 2, and it has to move to Python 3 someday, even if I'm in no hurry to make that move. The end of 2018, with only a year of official Python 2 support remaining, seems like a good time to take stock of what I expect to be the biggest aspect of that move, which is character set and Unicode issues (this is also the big issue I ignored when I got DWiki tentatively running under Python 3 a few years ago).

The current Python 2 version of DWiki basically ignores encoding issues. It allows you to specify the character set the HTML will say, but it pretty much treats everything as bytes and makes no attempts to validate that your content is actually valid in the character set you've claimed. This is not viable in Python 3 for various reasons, including that it's not how the Python 3 version of WSGI works (as covered in PEP 3333). Considering Unicode issues for a Python 3 version of DWiki means thinking about everywhere that DWiki reads and writes data from, and deciding what encoding that data is in (and then properly inserting error checks to handle when that data is not actually properly encoded).

The primary source of text data for DWiki is the text of pages and comments. Here in 2018, the only sensible encoding for these is UTF-8, and I should probably just hardcode that assumption into reading them from the filesystem (and writing comments out to the filesystem). Relying on Python's system encoding setting, whatever it is, seems not like a good idea, and I don't think this should be settable in DWiki's configuration file. UTF-8 also has the advantage for writing things out that it's a universal encoder; you can encode any Unicode str to UTF-8, which isn't true of all character encoding.

Another source of text data is the names of files and directories in the directory hierarchy that DWiki serves content from; these will generally appear in links and various other places. Again, I think the only sensible decision in 2018 is to declare that all filenames have to be UTF-8 and undefined things happen if they aren't. DWiki will do its best to do something sensible, but it can only do so much. Since these names propagate through to links and so on, I will have to make sure that UTF-8 in links is properly encoded.

(In general, I probably want to use the 'backslashreplace' error handling option when decoding to Unicode, because that's the option that both produces correct results and preserves as much information as possible. Since this introduces extra backslashes, I'll have to make sure they're all handled properly.)

For HTML output, once again the only sensible encoding is UTF-8. I'll take out the current configuration file option and just hard-code it, so the internal Unicode HTML content that's produced by rendering DWikiText to HTML will be encoded to UTF-8 bytestrings. I'll have to make sure that I consistently calculate my ETag values from the same version of the content, probably the bytestring version (the current code calculates the ETag hash very late in the process).

DWiki interacts with the HTTP world through WSGI, although it's all my own WSGI implementation in a normal setup. PEP 3333 clarifies WSGI for Python 3, and it specifies two sides of things here; what types are used where, and some information on header encoding. For output, generally my header values will be in ISO-8859-1; however, for some redirections, the Location: header might include UTF-8 derived from filenames, and I'll need to encode it properly. Handling incoming HTTP headers and bodies is going to be more annoying and perhaps more challenging; people and programs may well send me incorrectly formed headers that aren't properly encoded, and for POST requests (for example, for comments) there may be various encodings in use and also the possibility that the data is not correctly encoded (eg it claims to be UTF-8 but doesn't decode properly). In theory I might be able to force people to use UTF-8 on comment submissions, and probably most browsers would accept that.

Since I don't actually know what happens in the wild here, probably a sensible first pass Python 3 implementation should log and reject with a HTTP error any comment submission that is not in UTF-8, or any HTTP request with headers that don't properly decode. If I see any significant quantity of them that appears legitimate, I can add code that tries to handle the situation.

(Possibly I should start by adding code to the current Python 2 version of DWiki that looks for this situation and logs information about it. That would give me a year or two of data at a minimum. I should also add an accept-charset attribute to the current comment form.)

DWiki has on-disk caches of data created with Python's pickle module. I'll have to make sure that the code reads and writes these objects using bytestrings and in binary mode, without trying to encode or decode it (in my current code, I read and write the pickled data myself, not through the pickle module).

The current DWiki code does some escaping of bad characters in text, because at one point control characters kept creeping in and blowing up my Atom feeds. This escaping should stay in a Python 3 Unicode world, where it will become more correct and reliable (currently it really operates on bytes, which has various issues).

Since in real life most things are properly encoded and even mostly ASCII, mistakes in all of this might lurk undetected for some time. To deal with this, I should set up two torture test environments for DWiki, one where there is UTF-8 everywhere I can think of (including in file and directory names) and one where there is incorrectly encoded UTF-8 everywhere I can think of (or things just not encoded as UTF-8, but instead Latin-1 or something). Running DWiki against both of these would smoke out many problems and areas I've missed. I should also put together some HTTP tests with badly encoded headers and comment POST bodies and so on, although I'm not sure what tools are available to create deliberately incorrect HTTP requests like that.

All of this is clearly going to be a long term project and I've probably missed some areas, but at least I'm starting to think about it a bit. Also, I now have some preliminary steps I can take while DWiki is still a Python 2 program (although whether I'll get around to them is another question, as it always is these days with work on DWiki's code).

PS: Rereading my old entry has also reminded me that there's DWiki's logging messages as well. I'll just declare those to be UTF-8 and be done with it, since I can turn any Unicode into UTF-8. The rest of the log file may or may not be UTF-8, but I really don't care. Fortunately DWiki doesn't use syslog (although I've already wrestled with that issue).

Sidebar: DWiki's rendering templates and static file serving

DWiki has an entire home-grown template system that's used as part of the processing model. These templates should be declared to be UTF-8 and loaded as such, with it being a fatal internal error if they fail to decode properly.

DWiki can also be configured to serve static files. In Python 3, these static files should be loaded uninterpreted as (binary mode) bytestrings and served back out that way, especially since they can be used for things like images (which are binary data to start with). Unfortunately this is going to require some code changes in DWiki's storage layer, because right now these static files are loaded from disk with the same code that is also used to load DWikiText pages, which have to be decoded to Unicode as they're loaded.

DWikiPython3UnicodeIssues written at 01:01:29; Add Comment

(Previous 10 or go back to December 2018 at 2018/12/15)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.