Binary Buffers and Python Array Performance

Pre-allocate and Re-use for Performance

shipping containerWorking with binary packed data is typically used for high performance situations or passing data in/out of extensions. You can optimize by avoiding the overhead of allocating a new python array for each structure. The pack_into() and unpack_from() methods allow you to write to pre-allocated ctypes string buffers directly.

Here’s a common example of packing that I’ve fleshed out to include a few extra data types. You should probably use ctypes for the container whenever performance is critical, but the difference between python arrays and ctypes string_buffer doesn’t seem that bad for most cases.

#!/usr/bin/python -Ott
import array
import ctypes
import struct
import binascii

s = struct.Struct('c ? I f s')
values = ('z', True, 4, 2.54, 'word')
print 'ctypes.string_buffer:'

b = ctypes.create_string_buffer(s.size)
print 'Before packing  :', binascii.hexlify(b.raw)

s.pack_into(b, 0, *values)
print 'After packing  :', binascii.hexlify(b.raw)
print 'Unpacked:', s.unpack_from(b, 0)
print 'array:'

a = array.array('c', '\0' * s.size)
print 'Before packing :', binascii.hexlify(a)

s.pack_into(a, 0, *values)
print 'After packing  :', binascii.hexlify(a)
print 'Unpacked:', s.unpack_from(a, 0)

Performance Testing:

I tested both buffer types by looping through and filling a python array and a ctypes string_buffer using pack_into and unpack_from 250,000 times with an incremental double. It’s a silly example usage, but the python array took a little over 25 seconds to fill, while the ctypes took 22 seconds.

For comparison, I also testing with reallocation of the array on each iteration of the loop. The total time jumped up to a minute and a half.

A more specific example:

Binaries used for decoding on two ends of a messaging interface can convey a lot of information and is quite a bit more efficient than sending structures full of other larger things like more arrays, doubles, etc. The payload of a binary buffer saves a lot in terms of bandwidth when you’re sending messages very frequently.

And besides, you can’t always control the incoming data type when you need to talk to another piece of code through a messaging interface.