What encoding the syslog module uses in Python 3
Suppose that you're writing a Python 3 program that is going to
syslog() some information through the syslog module. Given that one of
the cardinal rules of Python 3 is that you should explicitly consider
encoding issues any time you send data to the outside world and
syslog() definitely does this, the immediate questions to
ask are how
syslog handles encoding the Unicode string you're
giving it and if it can ever raise a Unicode encoding error.
(Anything that can raise a Unicode encoding error on output needs
to be carefully guarded so that it doesn't blow up your program
some day in an obscure corner case. It would suck to have your
entire program abort with an uncaught exception as it tried to
syslog() some little monitoring message that wound up with peculiar
content this time.)
The answer turns out to be that the syslog module is effectively
hard-coded to use UTF-8. In particular, it does not use whatever
Python thinks is your system's default encoding (or default filesystem
encoding). I believe this means that
syslog.syslog() can never
raise a Unicode encoding error.
This may be documented somewhere by implication, but if so I couldn't find it in the Python 3(.5.2) documentation.
(As an occasional system programmer, I worked this out by reading the CPython source code.)
This issue is in the same general area of concern as PEP 383, but isn't really dealing
with the same issue since the syslog module only outputs data
to the outside world. Note that as far as I can see,
(and in fact any code in CPython that's like it) doesn't use the
special "surrogateescape" encoding mechanism introduced in PEP 383.
If you take stuff in from the outside world that winds up escaped
this way and try to syslog it out, you will not get the exact initial
bytes syslog'd; instead you get a conventional UTF-8 encoding of it.
Sidebar: How this works in the CPython source code
The syslog module is written in C. It turns the message into a C
string by calling
_PyUnicode_AsString, which is really
PyUnicode_AsUTF8. This uses the encoding you expect it to given
This implies that anything in the CPython source code that's turning a Python string into a C string through this function is actually using UTF-8. There seem to be a decent number of them, although I haven't looked in detail. This doesn't particularly surprise me, as it seems that CPython has made an executive decision here that UTF-8 will be the encoding for a number of cases where it needs to pick some encoding (ideally one that will never produce errors).