Efficient Data Storage — Key to Gold: Part2

Yogesh
7 min readMar 20, 2024

--

For the background on this story, see here

In this part, we explore the possibility of gaining significant storage savings with zstd when we sort data on specific columns as explored in this story.

Experiment Continued

All throughout the next steps, i will be using Parquet with zstd compression at default(3) compression level(CL) and continue with the TPC-H generated CSV files.
In Rust, stored as Parquet with zstd(3 CL), the file size is 184,214,037Bytes.

Here are the column details from the spec for the lineitem table.

LINEITEM Table Layout: 16 columns
----------------------
Column Name Datatype Requirements Comment
----------- --------------------- -------
L_ORDERKEY identifier Foreign Key to O_ORDERKEY
L_PARTKEY identifier Foreign key to P_PARTKEY, first part of the compound Foreign Key to (PS_PARTKEY, PS_SUPPKEY) with L_SUPPKEY
L_SUPPKEY Identifier Foreign key to S_SUPPKEY, second part of the compound Foreign Key to (PS_PARTKEY, PS_SUPPKEY) with L_PARTKEY
L_LINENUMBER integer
L_QUANTITY decimal
L_EXTENDEDPRICE decimal
L_DISCOUNT decimal
L_TAX decimal
L_RETURNFLAG fixed text, size 1
L_LINESTATUS fixed text, size 1
L_SHIPDATE date
L_COMMITDATE date
L_RECEIPTDATE date
L_SHIPINSTRUCT fixed text, size 25
L_SHIPMODE fixed text, size 10
L_COMMENT variable text size 44
Primary Key: L_ORDERKEY, L_LINENUMBER
  1. I decided to use all, some columns for sorting as listed below. Columns chosen were 1,2,3 etc.
all_columns
["column_1", "column_2","column_3", "column_4", "column_5","column_6", "column_7"]
["column_1", "column_2","column_3", "column_4"]
// with all columns
let all_cols = df.get_column_names();
let sort_order = vec![false; all_cols.len()];
let mut odf = df.sort(&all_cols, sort_order, false).unwrap();
println!("Total Rows:{:?}", odf.shape());
let mut file = File::create("sorted.tbl").unwrap();
let fsize = ParquetWriter::new(&mut file).finish(&mut odf).unwrap();
println!("Parquet Write Zstd size: {}", fsize);
Parquet Write Zstd  size: 185959838

Yes, straight sorting by column 1,2,3 etc, increased the file size by 991,580 Bytes (0.536%) when compared to the default size (184,968,258 Bytes).

2. Next, I decided to use primary keys first, and then some columns. Some of the sort order used were

["column_1", "column_4"]
["column_1", "column_4", "column_2", "column_3"]
["column_1", "column_4", "column_2", "column_3", "column_5","column_6", "column_7"]
["column_1", "column_4", "column_9", "column_10"]
Parquet Write Zstd  size: 184968258

Looks like primary keys(PKs) have high cardinality and gives the best possible file size matching the default file size. Let’s explore further.

3. I looked at the cardinality of all columns, because it is the key criteria in selecting the sort order according to the story. I ignored the last 3 text columns when getting the unique counts.

Column Name  nunique
----------- -------
column_1 1500000
column_2 200000
column_3 10000
column_4 7
column_5 50
column_6 933900
column_7 11
column_8 9
column_9 3
column_10 2
column_11 2526
column_12 2466
column_13 2554

Looking at the cardinalities, PK1 (column_1: L_ORDERKEY) has high, and PK2(column_4: L_LINENUMBER) has low cardinality. I tested sorting with different combinations of columns.

a) I ran the test with Pandas/PyArrow

import pandas as pd
import os

df = pd.read_csv('lineitem.tbl', sep='|', header=None)
path = 'data.parquet'
df[list(df.columns)[:-1]].to_parquet(path, engine='pyarrow', compression='zstd')
print(f"File size: {os.path.getsize(path)}")
File size: 168436855

Now, the results of file size with different sort order.

path = 'data-sorted.parquet'
for cols in ([1, 6], [1, 13], [1, 6, 2, 3], [1, 6, 4, 2, 3], [2, 1, 6],
[6, 1, 4, 2, 3], [6, 2, 1, 3], [6, 2], [6, 2, 1],[6, 1, 2, 3],
[6, 1, 3], [6, 13], [6, 3], [2,6], [2, 6, 3],
[2, 6, 3, 1], [2, 6, 1, 3]):
sdf = df[list(df.columns)[:-1]].sort_values(by=list(np.array(cols) - 1))
sdf.to_parquet(path, engine='pyarrow', compression='zstd')
print(f"Sorted {cols} - File size: {os.path.getsize(path)}")
Sorted [1, 6] - File size: 176012283
Sorted [1, 13] - File size: 175904840
Sorted [1, 6, 2, 3] - File size: 176012298
Sorted [1, 6, 4, 2, 3] - File size: 176012283
Sorted [2, 1, 6] - File size: 178486162
Sorted [6, 1, 4, 2, 3] - File size: 169493721
Sorted [6, 2, 1, 3] - File size: 167551595
Sorted [6, 2] - File size: 167551595
Sorted [6, 2, 1] - File size: 167551595
Sorted [6, 1, 2, 3] - File size: 169493715
Sorted [6, 1, 3] - File size: 169493717
Sorted [6, 13] - File size: 169312180
Sorted [6, 3] - File size: 169394919
Sorted [2, 6] - File size: 176381828
Sorted [2, 6, 3] - File size: 176366768
Sorted [2, 6, 3, 1] - File size: 176366768
Sorted [2, 6, 1, 3] - File size: 176381828
Sort by Columns    File size increase(%)
--------------- ---------------------
[1, 6] 4.3039%
[1, 13] 4.2455%
[1, 6, 2, 3] 4.3039%
[1, 6, 4, 2, 3] 4.3039%
[2, 1, 6] 5.6303%
[6, 1, 4, 2, 3] 0.6235%
[6, 2, 1, 3] -0.5284%
[6, 2] -0.5284%
[6, 2, 1] -0.5284%
[6, 1, 2, 3] 0.6235%
[6, 1, 3] 0.6235%
[6, 13] 0.5170%
[6, 3] 0.5656%
[2, 6] 4.5044%
[2, 6, 3] 4.4963%
[2, 6, 3, 1] 4.4963%
[2, 6, 1, 3] 4.5044%
  • Looks like the best reduction compression we achieve is, 0.53% lesser file size when sorted with columns [6, 2] (adding additional column didn’t help). Those 2 columns are not the top 2 in terms of high cardinality, but are 2nd and 3rd.
  • Interestingly, choosing the 1st and 2nd high cardinal values columns [1, 6] increased the file size by 4.3%.

b) With Rust

let fname = "lineitem.tbl";
let df = CsvReader::from_path(fname)
.expect("CSV file read error")
.has_header(false)
.with_separator(b'|')
.finish()
.unwrap();
let mut all_cols = df.get_column_names();
all_cols.pop();
let mut df = df.select(all_cols).unwrap();

let mut file = File::create("sorted.tbl").unwrap();
let fsize = ParquetWriter::new(&mut file).finish(&mut df).unwrap();
println!("Unsorted - Parquet file size: {}", fsize);

let cols = ["None", "column_1", "column_2", "column_3", "column_4",
"column_5", "column_6", "column_7", "column_8",
"column_9", "column_10", "column_11", "column_12", "column_13"];
let scols = [vec![1,6], vec![1,13], vec![1, 6, 2, 3], vec![1, 6, 4, 2, 3],
vec![2, 1, 6], vec![6, 1, 4, 2, 3], vec![6, 2, 1, 3],
vec![6, 2], vec![6, 2, 1],
vec![6, 1, 2, 3], vec![6, 1, 3], vec![6, 13], vec![6, 3],
vec![2,6], vec![2,6,3], vec![2,6,3,1]];

for col in scols {
let mut sort_cols: Vec<&str> = vec![];
for itm in col {
sort_cols.push(cols[itm]);
}
let sort_order = vec![false; sort_cols.len()];
let mut odf = df.sort(&sort_cols, sort_order, false).unwrap();
let mut file = File::create("sorted.tbl").unwrap();
let fsize = ParquetWriter::new(&mut file).finish(&mut odf).unwrap();
println!("{:?} - Parquet file size: {}", sort_cols, fsize);
}
Unsorted - Parquet file size: 184214037
["column_1", "column_6"] - Parquet file size: 185320520
["column_1", "column_13"] - Parquet file size: 185342978
["column_1", "column_6", "column_2", "column_3"] - Parquet file size: 185320525
["column_1", "column_6", "column_4", "column_2", "column_3"] - Parquet file size: 185320520
["column_2", "column_1", "column_6"] - Parquet file size: 169215522
["column_6", "column_1", "column_4", "column_2", "column_3"] - Parquet file size: 169974045
["column_6", "column_2", "column_1", "column_3"] - Parquet file size: 169589117
["column_6", "column_2"] - Parquet file size: 169589117
["column_6", "column_2", "column_1"] - Parquet file size: 169589117
["column_6", "column_1", "column_2", "column_3"] - Parquet file size: 169974079
["column_6", "column_1", "column_3"] - Parquet file size: 169974073
["column_6", "column_13"] - Parquet file size: 169814690
["column_6", "column_3"] - Parquet file size: 170000661
["column_2", "column_6"] - Parquet file size: 166440218
["column_2", "column_6", "column_3"] - Parquet file size: 166416403
["column_2", "column_6", "column_1", "column_3"] - Parquet file size: 166440218
["column_2", "column_6", "column_3", "column_1"] - Parquet file size: 166416403
Sort by Columns    File size increase(%)
--------------- ---------------------
[1, 6] 0.5971%
[1, 13] 0.6091%
[1, 6, 2, 3] 0.5971%
[1, 6, 4, 2, 3] 0.5971%
[2, 1, 6] -8.8636%
[6, 1, 4, 2, 3] -8.3777%
[6, 2, 1, 3] -8.6237%
[6, 2] -8.6237%
[6, 2, 1] -8.6237%
[6, 1, 2, 3] -8.3777%
[6, 1, 3] -8.3777%
[6, 13] -8.4794%
[6, 3] -8.3608%
[2, 6] -10.6788%
[2, 6, 3] -10.6788%
[2, 6, 3, 1] -10.6788%
[2, 6, 1, 3] -10.6946%
  • Looks like the best reduction we achieve is, 10.6946% when sorted by Columns [2, 6, 1, 3]. Those 4 columns are 3rd, 2nd, 1st and 4th in terms of high cardinality.
  • Choosing the 1st and 2nd high cardinal values columns [1, 6] increased the file size by 0.5971%.

Observation:

  • Parquet with zstd compression sizes of PyArrow differs from Rust, suggesting implementation differences.
  • Sort order results also differs.

Rust: with 7.5GB CSV file and sort columns [2, 6, 1, 3], [2, 6, 3]

let fname = "lineitem-7.5G.tbl";
// Rest of the code similar to above

Result

Unsorted - Parquet file size: 1889330905
["column_2", "column_6", "column_1", "column_3"] - Parquet file size: 1677102164
["column_2", "column_6", "column_3"] - Parquet file size: 1676901584

That’s 12.668% file size reduction with sorted columns [2, 6, 3]

Rust: 724MB CSV file, compression level:15, with Int32 and Float32 data types and sort columns [2, 6, 1, 3], [2, 6, 3]

// highlighting only the changes in code
let mut df = df
.lazy()
.select([
col("column_1").cast(DataType::Int32),
col("column_2").cast(DataType::Int32),
col("column_3").cast(DataType::Int32),
col("column_4").cast(DataType::Int32),
col("column_5").cast(DataType::Int32),
col("column_6").cast(DataType::Float32),
col("column_7").cast(DataType::Float32),
col("column_8").cast(DataType::Float32),
col("column_9"),
col("column_10"),
col("column_11"),
col("column_12"),
col("column_13"),
col("column_14"),
col("column_15"),
col("column_16"),
])
.collect()
.expect("Unable to cast datatypes");

let clvl = ZstdLevel::try_new(15).expect("Incorrect compression level");
let fsize = ParquetWriter::new(&mut file)
.with_compression(ParquetCompression::Zstd(Some(clvl)))
.finish(&mut odf)
.unwrap();
Unsorted - Int32, Float32, Parquet file size: 177535910
Unsorted - compression level 15, Parquet file size: 166617931
Unsorted - compression level 15, Int32, Float32, Parquet file size: 164928078
["column_2", "column_6", "column_1", "column_3"] - Parquet file size: 142688631
["column_2", "column_6", "column_3"] - Parquet file size: 142645530
  • File size reduction due to data type changes and increased compression level is 11.6936% (164,928,078 vs 184,214,037 Bytes)
  • With sorted columns on top of them, we see 29.1411% (142,645,530 vs 184,214,037 Bytes) file size reduction.
  • Compression level has an higher impact than the data type changes (166,617,931 vs 177,535,910) at the cost of increased write time.

Rust: 7.5GB CSV file, compression level:15, with Int32 and Float32 data types and sort columns [2, 6, 1, 3], [2, 6, 3]

Unsorted - compression level 15, Int32, Float32, Parquet file size: 1690037177
["column_2", "column_6", "column_1", "column_3"] - Parquet file size: 1423624455
["column_2", "column_6", "column_3"] - Parquet file size: 1423409579
  • File size reduction due to data type changes and increased compression level is 11.7923% (1,690,037,177 vs 1,889,330,905 Bytes)
  • With sorted columns on top of them, we see 32.7328% (142,645,530 vs 1,889,330,905 Bytes) file size reduction.

Final thoughts

  • With Rust, on the 724MB csv file, we see 10.69% storage savings when sorted and saved as Parquet with zstd compression. No where close to the 78% savings.
  • On the 7.5GB csv file, we see 12.67% storage savings. An indication that bigger files, higher cardinalities, can lead to smaller file size.
  • With higher compression level and reduced precision data types (Int32, Float32 vs Int64, Float64) we see really good 32.73% storage savings.
  • Huge reduction in file size was probably a combination of very high cardinalities with really large files, specific kind of data types, higher compression level etc.
  • If sorted incorrectly file size will increase.
  • There are implementation differences with libraries/languages and hence the sort order is not universal.
  • There is no formula/algorithm to identify the correct sort order except by trial and error specific to a library. This approach is generally not scalable. It can be used only in special cases with experimentation, time and resources.
  • Maybe there is a way to achieve high storage savings in all/most general cases. If someone explores further, makes it possible and publishes the way, it will be a huge win for all.

--

--