From Compression to Query Pruning: Exploring Snowflake Micro-Partitions

Saisuman Singamsetty
4 min readJul 6, 2024

--

A micro-partition is a file stored in the Snowflake storage service.

Snowflake Micro-partitions

Micro-partitions use a proprietary, closed -source file format created by Snowflake.

Micro-partition contain a header enclosing metadata describing the data stored, with the actual data grouped by column and stored in a compressed format.

Micro-Partition Anatomy

Groups of data stored in tables are mapped to individual micro-partitions, organized in columnar fashion. Which means the data is compressed by organizing it into columns rather than by row for each micro-partition.

Data organization in micro-partitions

This allows Snowflake to scan data for a column value without touching the rest of the data in a given row. This size and structure allows for both optimization and efficiency in query processing. Micro-partitioning is automatically performed on all Snowflake tables.

Columar storage in Snowflake

A single micro-partition can contain up to 16 megabytes of compressed data, that uncompressed is typically between 50 and 500 megabytes. There is no limit to the number of micro-partitions a table can have.

Compressed and Uncompressed data storage in Snowflake

Small tables less than 500 megabytes of uncompressed data may only have a single micro-partition, and for large tables Snowflake can create thousands or even millions of micro-partitions. For every new arrival of data chunks, it does not modify the old micro-partitions because they are all immutable by nature, and hence it adds more and more micro-partitions as new data arrives.

Unlike tables in traditional data warehouses or databases, that typically have a relatively small number of partitions, a table in Snowflake can easily have millions of partitions.

Traditional databases vs Snowflake

In the micro-partition header, the byte ranges for each column within the micro-partition are stored, allowing Snowflake to retrieve only the columns relevant to a query using a byte range get. This is why queries will run faster if you reduce the number of columns selected.

Micro-partition header

In simpler words, if a query specifies a filter predicate on a range of values that accesses 30% of the values in the range, it should ideally only scan the portion of the micro-partitions that contain the data, in this case, 30% of the values.

Query pruning in Snowflake

Micro partitions are derived automatically as data is ingested in Snowflake. They don’t need to be explicitly defined upfront or maintained by users.

Metadata about each column for the micro partition is stored by Snowflake in its metadata cache in the cloud services layer.

Column metadata stored in Cloud services layer

This metadata is used to provide extremely fast results for basic analytical queries. Micro-partitions allow users to perform extremely efficient DML and fine -grained pruning on very large tables, which can be composed of millions or even hundreds of millions of micro-partitions.

Efficient DML and fine -grained pruning on very large tables

The min and max values crucially provide Snowflake with the ability to perform pruning.

min and max values used by Snowflake to perform pruning

--

--