What encoding the syslog module uses in Python 3

September 17, 2016

Suppose that you're writing a Python 3 program that is going to syslog() some information through the syslog module. Given that one of the cardinal rules of Python 3 is that you should explicitly consider encoding issues any time you send data to the outside world and that syslog() definitely does this, the immediate questions to ask are how syslog handles encoding the Unicode string you're giving it and if it can ever raise a Unicode encoding error.

(Anything that can raise a Unicode encoding error on output needs to be carefully guarded so that it doesn't blow up your program some day in an obscure corner case. It would suck to have your entire program abort with an uncaught exception as it tried to syslog() some little monitoring message that wound up with peculiar content this time.)

The answer turns out to be that the syslog module is effectively hard-coded to use UTF-8. In particular, it does not use whatever Python thinks is your system's default encoding (or default filesystem encoding). I believe this means that syslog.syslog() can never raise a Unicode encoding error.

This may be documented somewhere by implication, but if so I couldn't find it in the Python 3(.5.2) documentation.

(As an occasional system programmer, I worked this out by reading the CPython source code.)

This issue is in the same general area of concern as PEP 383, but isn't really dealing with the same issue since the syslog module only outputs data to the outside world. Note that as far as I can see, syslog.syslog() (and in fact any code in CPython that's like it) doesn't use the special "surrogateescape" encoding mechanism introduced in PEP 383. If you take stuff in from the outside world that winds up escaped this way and try to syslog it out, you will not get the exact initial bytes syslog'd; instead you get a conventional UTF-8 encoding of it.

Sidebar: How this works in the CPython source code

The syslog module is written in C. It turns the message into a C string by calling _PyUnicode_AsString, which is really PyUnicode_AsUTF8. This uses the encoding you expect it to given its name.

This implies that anything in the CPython source code that's turning a Python string into a C string through this function is actually using UTF-8. There seem to be a decent number of them, although I haven't looked in detail. This doesn't particularly surprise me, as it seems that CPython has made an executive decision here that UTF-8 will be the encoding for a number of cases where it needs to pick some encoding (ideally one that will never produce errors).

Written on 17 September 2016.
« A shell thing: globbing operators versus expansion operators
A little shift in malware packaging that I got to watch »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Sep 17 00:17:35 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.