Performance Tuning tips for Bigdata Tables in HQL/IMPALA

4 min readJun 26, 2022

We all have been tested for our patience while performing series of aggregation and windowing on big tables. Sometimes it could take up to double digit minutes and in worse scenarios we can run out of memory on running clusters. Let’s look at few optimization techniques to always remember while declaring Bigdata tables:

Partitioning

· Partitioning divides the data physically during loading.

· It could be based on values from one or more columns.

· Partitioning speeds up queries that are based on columns included in partition by clause.

For example, with a school_records table partitioned on a year column, there is a separate data directory for each different year value, and all the data for that year is stored in a data file in that directory. A query that includes a WHERE condition such as YEAR=1966, YEAR IN (1989,1999), or YEAR BETWEEN 1984 AND 1989 can examine only the data files from the appropriate directory or directories, greatly reducing the amount of data to read and test and thus increasing the efficiency.

·         Syntax: create table census (name string) partitioned by (year smallint).

Bucketing

· In hive, the idea of bucketing is the division of data into ranges, or buckets, to provide the data additional structure and allow for more effective querying.

· Although bucketing and partitioning both separate and store data, there are some significant differences between the two. Data is grouped according to a specific value of the partition column (in partitioning), which is based on a column that is repeated throughout the dataset. As opposed to bucketing, which primarily uses primary keys or non-repeated values in a dataset, bucketing organizes data by a range of values.

· Since creating too many little partitions and directories is a possibility, partitioning is most effective for small amounts of data. Additionally, joins at the Map side will be speedier since bucketing produces equal amounts of data in each segment.

· Syntax: CREATE TABLE IF NOT EXISTS bucketsample

(trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP)

CLUSTERED BY (trip_id)

INTO 20 BUCKETS

STORED AS PARQUET

In this example, the bucketing column (trip_id) is specified by the CLUSTERED BY (trip_id) clause, and the number of buckets (20) is specified by the INTO 20 BUCKETS clause.

· When working with huge datasets that may need to be divided into clusters for more effective management and the ability to run join queries with other large datasets, bucketing is helpful. The main use case is combining two sizable datasets that are subject to resource limitations like memory limits.

Compression

· Impala supports several familiar file formats and each of them could’ve significant performance consequences.

· Some file formats include compression support that affects the size of data on the disk and, consequently, the amount of I/O and CPU resources required to deserialize data.

· The amounts of I/O and CPU resources required can be a limiting factor in query performance since querying often begins with moving and decompressing data. To reduce the potential impact of this part of the process, data is often compressed.

· By compressing data, a smaller total number of bytes are transferred from disk to memory. This reduces the amount of time taken to transfer the data, but a trade-off occurs when the CPU decompresses the content.

· Syntax: create table table_name (x INT, y STRING) STORED AS PARQUET;

Refresh Statement

· To accurately respond to queries, the Impala must have current metadata about those databases and tables that are referenced in Impala queries.

· Generally, a table created and filled inside Impala already has meta data about the table stored in meta store, but the ‘Refresh’ statement comes into play when a file is loaded/inserted via an external platform.

· This statement should be used after loading new data files into the HDFS data directory for the table. (Once you have set up an ETL pipeline to bring data into Impala on a regular basis, this is typically the most frequent reason why metadata needs to be refreshed.)

· After issuing ALTER TABLE, INSERT, LOAD DATA, or other table-modifying SQL statement in Hive.

· Note: In Impala 2.3 and higher, the syntax ALTER TABLE table_name RECOVER PARTITIONS is a faster alternative to REFRESH when the only change to the table data is the addition of new partition directories through Hive or manual HDFS operations.

·         Syntax: REFRESH FUNCTIONS db_name.

Compute Stats

· The COMPUTE STATS statement gathers information about volume and distribution of data in a table and all associated columns and partitions.

· The information is stored in the metastore database and used by Impala to help optimize queries.

· For example, if Impala can determine that a table is large or small or has many or few distinct values it can organize and parallelize the work appropriately for a join query or insert operation.

· Syntax: COMPUTE STATS [db_name.]table_name

Memory Limit

· Memory limits. You can limit the amount of memory available to Impala.

· You can specify the memory limit using absolute notation such as 500m or 2G, or as a percentage of physical memory such as 60%.

· For example, to allow Impala to use no more than 70% of system memory, change

export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \

-log_dir=${IMPALA_LOG_DIR} \

-state_store_port=${IMPALA_STATE_STORE_PORT} \

-state_store_host=${IMPALA_STATE_STORE_HOST} \

-be_port=${IMPALA_BACKEND_PORT}}

to:

export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \

-log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT} \

-state_store_host=${IMPALA_STATE_STORE_HOST} \

-be_port=${IMPALA_BACKEND_PORT} -mem_limit=70%}

If you read through this article till here, hope the above techniques were a help to you. I looked up at most of this information from: https://impala.apache.org/docs/ and for some more details of performance tuning go on to: https://impala.apache.org/docs/build/html/topics/impala_performance.html.

Thanks, have a great day.

Performance Tuning tips for Bigdata Tables in HQL/IMPALA

Written by Ratnark Gandhi