Schema on read and Schema-on-write

Deepa Vasanthkumar
2 min readDec 7, 2021

--

In our traditional approach, we create a relational database with tables by specifiying the schema. Once we have configured the schemas, created the tables, we can begin to load the data. We may be loading data in bulk or as small batches, however we will be ingesting data based on the schema configured. And, once the data is loaded into the table, we can begin to execute analytical queries on our tables. This is the schema-on-write

Now comes a problem, we have data and we don't know the schema, to create table and to analyze the data we need to know the schema ahead, so practically the schema-on-write fails in this situation.

What if we load the files/data and then analyze its structure and characteristics, now this quantifies for schema-on-read

As an example, Amazon Athena is a SQL engine that runs on top of the S3 data lake.We can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. Amazon Athena can process unstructured, semi-structured, and structured data sets. Examples include CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC.

Both Schema-on-read and schema-on-write has its own uses and cons.

Schema-on-read provides scalability and flexibility to apply any analytical operations,

schema-on-write helps to model a structured data model, thereby enabling faster reads based on a predefined schema.

Schema-on-read provides much needed flexibility to an analytical project. Not having to rely on physical schemas improves the overall performance when it comes to high volume data loads. Because there are no physical constraints for the data, it works well with datasets with undefined data structures. It gives customers the power to experiment and try out different options.

--

--