ORC files in OpenStreetMap

Ramya Ragupathy
2 min readMay 19, 2018

--

ORC (Optimised Row Columnar files) as the name implies is a columnar file format, developed by HortonNetworks and now a part of larger Apache umbrella. Unlike flat files, data in ORC is stored in columns.

Entire file is divided into 3 parts: header,body and footer. Body contains the actual data in rows called stripes. Each stripe is 250 MB in size and is divided further into 3 sections:

  • index — indexes for the stored data
  • actual data
  • stripe footer

As shown here, index and the actual data are stored in columns.

ORC File Structure

ORC indexes helps in locating data, since they contain precise information on min and max column values and row position of data within each column.

Stripe footer at each stripe level & the file footer is for metadata & statistics on stripes.

Why use ORC for OpenStreetMap?

ORC files allows data to be read in parallel. For a big data resource like OpenStreetMap, this file format enables deeper analysis with the data.

Recently we’ve OpenStreetMap data publicly available on Amazon S3 in osm file format. Seth Fitzsimmons has written on how to deploy the entire dataset on Amazon Athena and run custom queries. One critical step for the deployment on Athena is to convert the file to orc format. Through Amazon Athena, SQL queries are run in parallel against the dataset to get desired results.

References:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

--

--

Ramya Ragupathy

Bread & Butter through technology. Enthused by Tamil Literature & World Cinema