ORC files in OpenStreetMap
ORC (Optimised Row Columnar files) as the name implies is a columnar file format, developed by HortonNetworks and now a part of larger Apache umbrella. Unlike flat files, data in ORC is stored in columns.
Entire file is divided into 3 parts: header,body and footer. Body contains the actual data in rows called stripes. Each stripe is 250 MB in size and is divided further into 3 sections:
- index — indexes for the stored data
- actual data
- stripe footer
As shown here, index and the actual data are stored in columns.
ORC indexes helps in locating data, since they contain precise information on min and max column values and row position of data within each column.
Stripe footer at each stripe level & the file footer is for metadata & statistics on stripes.
Why use ORC for OpenStreetMap?
ORC files allows data to be read in parallel. For a big data resource like OpenStreetMap, this file format enables deeper analysis with the data.
Recently we’ve OpenStreetMap data publicly available on Amazon S3 in osm
file format. Seth Fitzsimmons has written on how to deploy the entire dataset on Amazon Athena and run custom queries. One critical step for the deployment on Athena is to convert the file to orc
format. Through Amazon Athena, SQL queries are run in parallel against the dataset to get desired results.
References:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC