Loading JSON formatted strings into Google BigQuery (Python3)

Let’s say you have a list of 100 nested objects that you want to write to a Google BigQuery table. Using a custom JSON encoder it is relatively straightforward to serialize these objects to a list of JSON formatted strings:

import json
json_strs = [json.dumps(o, cls=ComplexEncoder) for o in nested_objs]

The Python Client for Google BigQuery offers the insert_rows_json method which seems a good fit but requires values to be JSON-compatible representations. This means deserializing our freshly serialized JSON formatted strings:

import json
rows_json = [json.loads(s) for s in json_strs]

Nothing wrong with this of course but perhaps there is another way to directly write the JSON formatted strings to BigQuery.

There is plenty of general documentation on loading JSON data into BigQuery and some specific guidance on loading newline delimited JSON data. Let’s represent our own data as a newline separated concatenation of JSON formatted strings:

json_strs = "\n".join(json_strs)

The load_table_from_file method can be used to upload contents from a file-like object. If there is no archival use case, we probably want to save ourselves the filesystem roundtrip. Since the method accepts file-like object we can create an in-memory I/O stream for our JSON strings using StringIO:

import io
file_obj = io.StringIO(json_strs)

Or to combine the two last statements in a single line:

import io
file_obj = io.StringIO("\n".join(json_strs))

Now we just need to specify a BigQuery job config to set the correct source format and make sure we append rather than overwrite an existing table:

from google.cloud import bigquery as bq
job_config = bq.LoadJobConfig()
job_config.write_disposition = "WRITE_APPEND"
job_config.source_format = "NEWLINE_DELIMITED_JSON"

Putting it all together in the load_table_from_file method looks something like this (don’t forget to authenticate your application):

import json
import io
from google.cloud import bigquery as bq
# BigQuery destination table
DATASET_ID = "dataset_id"
TABLE_ID = "table_id"
# Assuming you have a custom JSON encoder named ComplexEncoder
json_strs = [json.dumps(o, cls=ComplexEncoder) for o in nested_objs]
# Create the in-memory IO stream
file_obj = io.StringIO("\n".join(json_strs))
# Set up BigQuery
client = bq.Client()
table_ref = client.dataset(DATASET_ID).table(TABLE_ID)
job_config = bq.LoadJobConfig()
job_config.write_disposition = "WRITE_APPEND"
job_config.source_format = "NEWLINE_DELIMITED_JSON"
# Load table from our file-like object
client.load_table_from_file(
file_obj,
table_ref,
job_config=job_config).result()

It is always nice to have theoretical options but performance is probably going to matter. I ran a quick performance test writing a list of 100 of JSON formatted strings to a Google BigQuery table using both methods:

  • insert_rows_json: 5.2 sec
  • load_table_from_file: 8.3 sec

So definitely use insert_rows_json where you can but the load_table_from_file approach could still prove useful to get tackle some know issues around table creation (Nov 2018).