Python 3 supports not churning memory on IO
I am probably late to this particular party, just as I am late to
many Python 3 things, but today (in the course of research for
another entry) I discovered the pleasant fact that Python 3 now
supports read and write IO to and from appropriate pre-created byte
buffers. This is supported at the low level and also at the high
level with file objects (as covered in the io
module).
In Python 2, one of the drawbacks of Python for relatively high performance IO-related code was that reading data always required allocating a new string to hold it, and changing what you were writing also required new strings (you could write the same byte string over and over again without memory allocation, although not necessarily a Unicode string). Python 3's introduction of mutable bytestring objects (aka 'read-write bytes-like objects') means that we can bypass both issues now. With reading data, you can read data into an existing mutable bytearray (or a suitable memoryview), or a set of them. For writing data, you can write a mutable bytestring and then mutate it in place to write different data a second time. This probably doesn't help much if you're generating entirely new data (unless you can do it piece by piece), but is great if you only need to change a bit of the data to write a new chunk of stuff.
One obvious question here is how you limit how much data you read.
Python modules in the standard library appear to have taken two
different approaches to this. The os
module and the io
module use the total size of
the pre-allocated buffer or buffers you've provided as the only
limit. The socket
module defaults to the
size of the buffer you provide, but allows you to further limit the
amount of data read to below that. This initially struck me as odd,
but then I realized that network protocols often have situations
where you know you want only a few more bytes in order to complete
some element of a protocol. Limiting the amount of data read below
the native buffer size means that you can have a single maximum-sized
buffer while still doing short reads if you only want the next N
bytes.
(If I'm understanding things right, you could do this with a
memoryview of explicitly limited size. But this would still require
a new memoryview object, and they actually take up a not tiny amount
of space; sys.getsizeof()
on a 64-bit Linux machine says they're
192 bytes each. A bytearray's fixed size is actually smaller,
apparently coming in at 56 bytes for an empty one and 58 bytes for
one with a single byte in it.)
Sidebar: Subset memoryviews
Suppose you have a big bytearray object, and you want a memoryview of the first N bytes of it. As far as I can see, you actually need to make two memoryviews:
>>> b = bytearray(200) >>> b[0:4] bytearray(b'\x00\x00\x00\x00') >>> m = memoryview(b) >>> ms = m[0:30] >>> ms[0:4] = b'1234' >>> b[0:4] bytearray(b'1234')
It is tempting to do 'memoryview(b[0:30])
', but that creates
a copy of the bytearray that you then get a memoryview of, so your
change doesn't actually change the original bytearray (and you're
churning memory). Of course if you intend to do this regularly,
you'd create the initial memoryview up front and keep it around for
the lifetime of the bytearray itself.
I'm a little bit surprised that memoryview objects don't have support for creating subset views from the start, although I'm sure there are good reasons for it.
|
|