Schema On Demand[Read]
I am from a generation which saw the transition from TV to online video platforms. Earlier we had a schedule to watch Shaktiman or Ben10. Now we watch whenever we want, wherever we want, and however we want. This is power of on Demand Service. We consume anything when we need.
Similar transition happened in data processing and analysis. Boom. Schema-on-Read. I call it schema on demand.
Big data is stored in raw format and schema is created at the time when we actually consume/read it, not while storing/writing data. While writing data, we usually don’t care about what’s coming in. In ELT (Extract Transform Load) the schema is defined only when the data is pulled and accessed for analysis. Data is stored at a leaf level in an untransformed state. The schema is defined only when the data is pulled and accessed to fulfill analysis requirement.
While we store data in S3 we don’t worry about type of data that is uploaded. We treat it as object. There is no underlying file system. Athena is a service which creates schema-on-demand.
In Hadoop ecosystem Spark supports Schema-on-demand using Reflection. HBase treats everything as byte-array and does not create any schema while storing data. It creates schema-on-read. Kibana also uses schema on read approach to allow users to run search and aggregations on top of raw logs.
Hope it helps in understanding Schema On demand and its meaning.