Wandering Thoughts archives

2024-01-19

A Django gotcha with Python 3 and the encoding of CharFields

We have a long standing Django web application, which has been static for a long time in Python 2. I've recently re-started working on moving it to Python 3 (an initial round of work was done some time ago), and in the process of this I ran into a surprising issue involving text encoding and database text fields (a 'CharField' in Django terminology).

As part of one database model, we have a random key, which for various reasons is represented in the database model as a modest size string:

class Request(models.Model):
  [...]
  access_hash = models.CharField(..., default=gen_random_hash)

(We use SQLite as our database, which may be relevant here.)

The actual access hash is a 64-bit random value read from /dev/urandom. We could represent this in a variety of ways; for instance, I could have just treated it as a 64-bit unsigned decimal number in string form, or a 64-bit (unsigned) hex number. But for no particularly strong reason, long ago I decided to base64 encode the raw random value. Omitting error checking, the existing version of this is:

def gen_random_hash():
  fp = open("/dev/urandom", "rb")
  c = fp.read(8)
  fp.close()
  # Trim off trailing awkward '=' character
  return base64.urlsafe_b64encode(c)[:-1]

(The use of "rb" as the open mode stems from my first round of updates for Python 3.)

When I ran our web application under Python 3 in testing mode and looked at the uses of this access hash, I discovered that the URLs we were generating for it in email included a couple of ''' in them. Inspection of the database table entry itself in the Django admin interface showed that the actual value for access hash was, for example (and this is literal):

b'm5AWGlUSR1c'

(0x27 is ', so I was getting "b'm5AWGlUSR1c&#x27" in the URL when written out for email in HTML. Once I looked I could see that giveaway leading 'b' too.)

There are two things going on here. The first is that in Python 3, base64.urlsafe_b64encode operates on bytes (which we're giving it since we read /dev/urandom in binary mode, making c a bytes object) and returns a bytes object, not a string. The second thing is that when we ask Django to store a bytes object in a CharField (possibly only as the result of a callable default value), Django string-ized it, yielding the b'...' form as the stored value.

At one level this is reasonable; Django is doing its best to store some random Python object into a string field, probably by just doing a str() on it. At another level, I wish Django would specifically refuse to do this conversion for bytes objects, because one Python 3 issue is definitely bytes/str confusions and this specific representation conversion is almost certainly a bad idea, unlike str()'ing things in general. Raising an exception by default would be much more useful.

The solution is to explicitly convert to a Unicode str and specify a suitable character set encoding, which for base64 can be 'ascii':

return str(base64.urlsafe_b64encode(c)[:-1], "ascii")

This causes the CharField values to look like they should, which means that URLs using the access hash no longer have ''' in them.

Hopefully there aren't any other cases of this lurking in our Django web application, but I suppose I should do some more testing and examine the database for alarming characters (which is relatively readily done with the management dumpdata command).

python/DjangoPython3FieldEncodingGotcha written at 22:36:51;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.