Python 3 supports not churning memory on IO

September 17, 2018

I am probably late to this particular party, just as I am late to many Python 3 things, but today (in the course of research for another entry) I discovered the pleasant fact that Python 3 now supports read and write IO to and from appropriate pre-created byte buffers. This is supported at the low level and also at the high level with file objects (as covered in the io module).

In Python 2, one of the drawbacks of Python for relatively high performance IO-related code was that reading data always required allocating a new string to hold it, and changing what you were writing also required new strings (you could write the same byte string over and over again without memory allocation, although not necessarily a Unicode string). Python 3's introduction of mutable bytestring objects (aka 'read-write bytes-like objects') means that we can bypass both issues now. With reading data, you can read data into an existing mutable bytearray (or a suitable memoryview), or a set of them. For writing data, you can write a mutable bytestring and then mutate it in place to write different data a second time. This probably doesn't help much if you're generating entirely new data (unless you can do it piece by piece), but is great if you only need to change a bit of the data to write a new chunk of stuff.

One obvious question here is how you limit how much data you read. Python modules in the standard library appear to have taken two different approaches to this. The os module and the io module use the total size of the pre-allocated buffer or buffers you've provided as the only limit. The socket module defaults to the size of the buffer you provide, but allows you to further limit the amount of data read to below that. This initially struck me as odd, but then I realized that network protocols often have situations where you know you want only a few more bytes in order to complete some element of a protocol. Limiting the amount of data read below the native buffer size means that you can have a single maximum-sized buffer while still doing short reads if you only want the next N bytes.

(If I'm understanding things right, you could do this with a memoryview of explicitly limited size. But this would still require a new memoryview object, and they actually take up a not tiny amount of space; sys.getsizeof() on a 64-bit Linux machine says they're 192 bytes each. A bytearray's fixed size is actually smaller, apparently coming in at 56 bytes for an empty one and 58 bytes for one with a single byte in it.)

Sidebar: Subset memoryviews

Suppose you have a big bytearray object, and you want a memoryview of the first N bytes of it. As far as I can see, you actually need to make two memoryviews:

>>> b = bytearray(200)
>>> b[0:4]
>>> m = memoryview(b)
>>> ms = m[0:30]
>>> ms[0:4] = b'1234'
>>> b[0:4]

It is tempting to do 'memoryview(b[0:30])', but that creates a copy of the bytearray that you then get a memoryview of, so your change doesn't actually change the original bytearray (and you're churning memory). Of course if you intend to do this regularly, you'd create the initial memoryview up front and keep it around for the lifetime of the bytearray itself.

I'm a little bit surprised that memoryview objects don't have support for creating subset views from the start, although I'm sure there are good reasons for it.

Written on 17 September 2018.
« The importance of explicitly and clearly specifying things
The Extended Validation TLS certificate endgame is here (to my surprise) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Sep 17 23:32:23 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.