Categories
Software Development

On Python and Pickles

Currently foremost in my mind has been my annoyances with Python.

My current gripes have been with pickle.

Rather than taking a conventional approach and devising a fixed protocol/markup for describing the objects and their state, they invented a small stack based machine which the serialisation library writes bytecode to drive in order to restore the object state.

If this sounds like overengineering, that’s because it is. It’s also overengineering that’s introduced potential security problems which are difficult to protect against.

Worse than this, rather than throwing out this mess and starting again when it was obvious that it wasn’t meeting their requirements, they just continued to extend it, introducing more opcodes.

Nevermind that when faced up against simpler serialisation approaches, such as state marshalling via JSON, it’s inevitably slower, and significantly more dangerous.

And then people like the celery project guys go off and make pickle the default marshalling format for their tools rather than defaulting to JSON (which they also support).

Last week, I got asked to assist with interpreting pickle data so we could peek into job data that had been queued with Celery. From Ruby.  The result was about 4 hours of swearing and a bit of Ruby coding to produce unpickle. I’ve since tidied it up a bit, written some more documentation, and published it (with permission from my manager of course).

For anybody else who ever has to face off against this ordeal, there’s enough documentation inside the python source tree (see Lib/pickletools.py and Lib/pickle.py) that you can build the pickle stack machine without having to read too much of the original source.  It also helps if you are familiar with Postscript as the pickle machine’s dictionary, tuple and list constructors work very similarly to Postscript’s array and dictionary constructs (right down to the use of a stack mark during construction).