Building a Schemaless BigQuery Data Warehouse to manage data pipelines with Dynamic schema

Rakeshmohandas
5 min readAug 7, 2022

--

Google previously announced in January 2022 that BigQuery would now support the JSON data type for data storage and manipulation. This is an important feature that Google has added to BigQuery because it eliminates the need for data to be in fixed schema format while being stored in BigQuery native storage, bringing BigQuery even closer to the hybrid world of SQL and NoSQL databases. This also eliminates the need for customers to create complex change-handling automation, pause streaming data ingest to allow for manual changes to the schema or data type, or write unplanned data to a catch-all String field that must later be parsed in a post-processing manner.

In the next few minutes, we’ll try to replicate this to a streaming schema-less pipeline and see what the real benefits of having this feature in bigQuery are, as well as how an organisation can build a very elastic and flexible schema-less data pipeline that can ingest data into bigquery and use the power of bigquery’s dremel engine to run queries on this json type data.

Now, let’s look at a very simple example of how to stream semi-structured data generated by an IOT application and ingest it into BigQuery. We’ll start by creating a new table called json.json_events that contains a single IOT event data that is being pushed from Pubsub and Dataflow to BigQuery. This data has three columns: iot timestamp, iot id, and iot attributes(iot data). We will send data from iot sensors to pubsub and then to BQ using dataflow.

We are not transforming any data in the pipeline; this will be an EL task, bringing data from the IOT application to BQ in real time. The idea is that sensors are constantly modified and new attributes are updated more frequently, and the downstream system should not be impacted by these changes in the source.

There are 3 steps to achieve this

  1. Create a Table with the Schema: Let’s build the table structure for the data that will be pushed from the data pipeline. You can use the command or run a SQL Data definition language (DDL) statement like the one below to do so.

2. Ingestion of streaming data from Pubsub to BigQuery directly: With the latest release of pubsub bigquery direct ingestion, you no longer need to write or run your own pipelines using dataflow for data ingestion from Pub/Sub into BigQuery. With the most recent BigQuery subscription in pubsub, users can write events directly from Cloud Pub/Sub to BigQuery. In this example, we will use this integration to move pubsub data from iot devices to BQ via pubsub and iot core.

Sample Datastream that is ingested from the Sensor to Pubsub

Post the ingestion of the sensor readings, we also added the location data to be captured to the sensor, now we will be reading the lat and long information from the sensor and ingesting the same to the streaming pipeline which will then be pushed to BQ directly

The advantage we experience in this is that our pipeline isn’t broken and data isn’t pushed to the error table, instead it is pushed to BQ even after the schema changes.

3. Querying the JSON type Data in BigQuery:

Now let us query this data to run our analysis.

If you carefully observe we have ran the queries with 2 different json queries with two different json schema and still able to retrieve the information without doing any edits to the schema, we will now save this query as a view and use this in our dashboards.

Important note:

BigQuery Storage native support for JSON data using the JSON data type is in preview as of August 8, 2022, and thus may not be immediately available to everyone. However, as is customary with Google, I believe the feature will soon be available everywhere.

Limitations:

The most recent limitation of using json data is given in the link , below are few limitation that you need to be careful before using JSON data

  1. To ingest JSON data into a table using a batch load job, the source data must be in CSV, Avro, or JSON format. Other batch load formats are incompatible.

2. External tables cannot have JSON data type columns defined.

3. The nesting limit of the JSON data type is 500.

4. Legacy SQL cannot be used to query a table containing JSON types.

5. The SQL editor in the Google Cloud console does not support auto-completion of JSON-related keywords.

6. Tables containing JSON types are not supported by Google Data Studio, so you have to use views on top of the queries run yor reports

7. JSON columns do not support row-level access policies.

With the previous example, you saw how, by using the JSON data type, you can ingest semi-structured JSON into BigQuery without first providing a schema for the JSON data. This allows you to store and query data that does not always follow strict schemas and data types. BigQuery can encode and process each JSON field individually by ingesting JSON data as a JSON data type.

You can then use the field access operator to query the values of fields and array elements within the JSON data, making JSON queries simple and cost effective. Please let us know what other applications of this feature on BigQuery you have in mind; we’d love to hear your ideas.

Thank you for your time, and please try this with other examples.

--

--