Hive has this wonderful feature of partitioning — a way of dividing a table into related parts based on the values of certain columns. Using partitions it’s easy to query a portion of data. Hive optimizes the data load operations based on the partitions.
Amazon’s Elastic Data Pipeline does a fine job of scheduling data processing activities. It spawns a cluster and executes Hive script when the data becomes available. And after all the jobs have completed the pipeline shuts down the EMR resource and exits. Since the cluster is only created and…