Working with compressed JSON in Python

R A
2 min readJan 22, 2019

--

[This post is entirely GPT-free]

A long JSON file can contain many redundancies, especially repeating keys.

For example, the real-time bus GPS data for Taipei during daytime contains hundreds of records in the following kind:

The whole unformatted file is 1'260'575 bytes long. Zipping the file results in an archive of 78'478 bytes. Thus, even before dumping the whole JSON record to disk, it makes sense to compress it. We will discuss how to do that in this note. The code is available here and here.

Compression

We first discuss the compression procedure in the interactive Python shell. We will use the following packages:

>>> import zlib, json
>>> from base64 import b64encode, b64decode

Let’s start with a JSON object, like this

>>> original = {'a': "A", 'b': "B"}

This is the object to compress:

>>> original
{'a': 'A', 'b': 'B'}

First, serialize the original object into a string:

>>> json.dumps(_)
'{"a": "A", "b": "B"}'

Next, encode it as a bytes object:

>>> _.encode('utf-8')
b'{"a": "A", "b": "B"}'

Compress the bytes into another bytes object:

>>> zlib.compress(_)
b'x\x9c\xabVJT\xb2RPrT\xd2QPJ\x02\xb1\x9c\x94j\x01-\xea\x04O'

Shrink the character range to 64 human-readable characters for easier inspection and handling:

>>> b64encode(_)
b'eJyrVkpUslJQclTSUVBKArGclGoBLeoETw=='

Interpret the bytes object as a string:

>>> _.decode('ascii')
'eJyrVkpUslJQclTSUVBKArGclGoBLeoETw=='

Package the result in a dictionary to include some mnemonic meta-info about the object:

>>> {'base64(zip(o))': _}
{'base64(zip(o))': 'eJyrVkpUslJQclTSUVBKArGclGoBLeoETw=='}

Coming back to the real-time bus data, the size of the compressed object is down to 155kb bytes from 1.26Mb. This is twice the size of the brute-zipped file because we used here the base64 encoding, but still an order of magnitude smaller than the original.

This is the whole compression procedure in a function:

Decompression

The reverse process of decompression is then:

>>> json.loads(zlib.decompress(b64decode(_['base64(zip(o))'])))
{'a': 'A', 'b': 'B'}

Note that

  • the compression works on any object serializable by json.dumps, not just JSON objects;
  • the decompressed object may differ from the original one, for example json.loads(json.dumps((1, 2))) returns the list [1, 2] instead of the original tuple (1, 2).

In some instances, however, we may want to admit compressed or uncompressed input, as the case may be. Therefore, in the decompression function we first check if the input is in the compressed format. If it is not, a RuntimeError is thrown — unless this error is explicitly suppressed by passing insist=False and the input object is returned unmodified.

JavaScript

To compress/decompress client-side check out this StackOverflow post.

The complete python module with unit tests

--

--

R A

Open to collaboration — busybus at null dot net