Partitioning & Bucketing
Data organization plays a crucial role in query performance. Depending upon whether you follow star schema or De-normalized(preferred) data warehouse Partitioning and Bucketing helps in improving the query performance.
Optimized Row Columnar (ORC) vs Parquet
Columnar file formats allows you to reduce the read operations in queries by allowing each column to be accessed individually. Both ORC and Parquet are good choice for different use cases. Do your benchmarks and zero-in on the best file format. Hortonworks distribution lean towards ORC while CDH leans towards Parquet. Thumb rule start with ORC.
Tez data processing framework combines the MapReduce task as a node of DAG, enforcing concurrency and serialization.
A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with very few function calls. So if your table have 5000 rows in the table and you are trying to do some string based filtering on some column, then you will see INPUT_RECORDS_PROCESSED is only 5.
Predicate push down
Basic idea of predicate pushdown is that, predicates of SQL query can be pushed closer to where the data lives. Meaning, we can run the predicates while reading the data. With predicate pushdown, predicates (where condition) will be pushed further up in the query(while reading the data). In simpler terms it tries to execute the expression as early as possible in plan.
select a.*, b.* from a join b on (a.col1 = b.col1)
where a.col1 > 20 and b.col2 > 40
In the above case if we have predicate pushdown enabled, first the filtering on both the table will happen and then the join will be performed with smaller data set. In absence of predicate pushdown it will first join the two tables and then do the filter the rows on where condition. Should have been called a predicate pushup, but anyways.
Note: Predicate pushdown idea is also available for ORC or Parquet files, Spark, Redshift Spectrum.
Use of Single scan
INSERT INTO target_table_1 SELECT * WHERE ATTRIBUTE1=XYZ
INSERT INTO target_table_2 SELECT * WHERE ATTRIBUTE1=PQR
In the above statement the table is read only once, if we do it in two separate insert it will read the table twice.
Enable compression of intermediate data
SET hive.intermediate.compression.codec =org.apache.hadoop.io.compress.SnappyCodec;
Hive’s job invokes a lot of Map/Reduce and generates a lot of intermediate data, by setting the above parameter compresses the Hive’s intermediate data before writing it out.
If the other table in a join is small enough to fit into memory then Map joins are really efficient.
If tables are bucketed and sorted SMB join should be converted to SMB Map-Join.
CREATE TABLE <table_name>(col1 STRING, col2 STRING)
SKEWED BY (col1) ON (col_val1, col_val2, col_val3)
[STORED AS DIRECTORIES];
We can create a table with skew and Hive will split table into separate files (or directories in case of list bucketing) automatically and take this fact into account during queries so that it can skip or include the whole file/directory if possible.
If you table is skewed and we have supplied SKEWED BY in table creation, we should leverage SKEWED JOIN as it helps splitting the join in two mapper one having skewed key and other remaining.
Cost based optimization
CBO Generates effective execution plan by scanning query. It converts queries into operator tree and assigns cost to operators and finally selects the plan having lowest cost. You should generate column statistics for tables.