Efficient Data Storage — Key to Gold: Part1

6 min readMar 20, 2024

Data is the new oil, new gold, we have heard this enough times. Enterprises are sitting on all kinds of data of all sizes (Big to small). Enterprises are active in their journey to leverage and monetize their data. And, they are at different phases on this journey. While the business is thinking Strategy, Customer, Monetize…, it is upon the Architects, Engineers to build robust, performant systems that will support turning their data into gold.

One of the key aspect of the data design is how do we save and retrieve the data. A few important knobs are, which data format, using which compression and at which compression level. One size doesn’t fit all. There are trade-offs. Engineers need to choose different knob settings on their various kinds,sizes of data. These engineering decisions has big impact on the overall infrastructure (storage, cpu, network) utilization, costs and ultimately user satisfaction.

On small datasets, the implications of data storage knobs may not be obvious. As we scale into high Gigabytes, Terabytes and Petabytes range, an inefficient storage setup will openly show the wounds of bad design decisions and will be costly to the business.

This post was inspired by a storage savings story. I wanted to explore on the savings and get a deeper understanding on the zstd format and compression levels. There are many good resources on zstd, here is one such good read.

Experiment

Not a controlled, scientific setup

I decided to use the TPC-H data. There are many good resources on TPC-H, here is one on how to use dbgen to generate data.

I generated only the lineitem table with 2 sizes 1G, 10G.

./dbgen -vf -s 1 -T L
./dbgen -vf -s 10 -T L

Generated csv (with | delimiter) files were 759,863,287 bytes (724 MB) and 7,775,727,688 bytes (7.24 GB).

Setup

I decided to use Polars for its performance. And it is gaining attention from the community familiar with pandas.

rustc 1.76.0

Cargo.toml dependencies

[dependencies]
polars = { version = "0.38.3", features = ["json", "parquet", "performant", "lazy"] }
sysinfo = "0.30.7"

Read csv file and capture time, cpu related metrics

let fname = "lineitem.tbl";
let mut sys = System::new();

let now = Instant::now();
sys.refresh_cpu();
let cpus: f32 = sys.cpus().len() as f32;
let mut df = CsvReader::from_path(fname)
    .expect("CSV file read error")
    .has_header(false)
    .with_separator(b'|')
    .finish()
    .unwrap();
sys.refresh_cpu();
let mut cpu_use: f32 = 0.0;
for cpu in sys.cpus() {
    cpu_use += cpu.cpu_usage();
}
let elapsed = now.elapsed();
println!("CSV Read time: {:.4?}, Avg CPU:{}", elapsed, cpu_use / cpus);

Result

CSV Read time: 0.6067 Avg CPU: 62.2635

I decided to use Parquet format with snappy compression as the baseline to be aligned with the savings story. With 724 MB CSV file, results below.

// Not showing the code for capturing metrics, it's the same as above
// Write the csv file as parquet
let snappyfile: String = format!("{fname}_Snappy");
let mut file = File::create(&snappyfile).unwrap();
let fsize = ParquetWriter::new(&mut file)
    .with_compression(ParquetCompression::Snappy)
    .finish(&mut df)
    .unwrap();

// Read parquet
let mut file = File::open(snappyfile).unwrap();
let _df = ParquetReader::new(&mut file).finish().unwrap();

Result

Parquet Write Snappy  size: 245697483 time: 1.4070 Avg CPU: 16.4287
Parquet Read Snappy time: 0.1911 Avg CPU: 68.8687

I wanted to explore the tradeoffs of zstd compression levels. Polars default zstd compression level is 3. Spark default is 1.

// Not showing the code for capturing metrics
// Storing the file size, time, cpu usage metrics in arrays
let mut time_ary: Vec<f32> = vec![];
let mut cpu_ary: Vec<f32> = vec![];
let mut stored_size_ary: Vec<u64> = vec![];
let mut rtime_ary: Vec<f32> = vec![];
let mut rcpu_ary: Vec<f32> = vec![];

let zstdfile: String = format!("{fname}_Zstd");
for comp_lvl in 1..=22 {
    // Write file in parquet zstd with set compression level
    let mut file = File::create(&zstdfile).unwrap();
    let clvl = ZstdLevel::try_new(comp_lvl).expect("Incorrect compression level");
    let fsize = ParquetWriter::new(&mut file)
        .with_compression(ParquetCompression::Zstd(Some(clvl)))
        .finish(&mut df)
        .unwrap();
    time_ary.push(elapsed);
    cpu_ary.push(cpu_use / cpus);
    stored_size_ary.push(fsize);

    // Read the parquet zstd file
    let mut file = File::open(&zstdfile).unwrap();
    let _df = ParquetReader::new(&mut file).finish().unwrap();
    rtime_ary.push(elapsed);
    rcpu_ary.push(cpu_use / cpus);
}
println!("Write Time: {:?}", time_ary);
println!("Write CPU: {:?}", cpu_ary);
println!("Storage size: {:?}", stored_size_ary);
println!("Read Time: {:?}", rtime_ary);
println!("Read CPU: {:?}", rcpu_ary);

Result

Write Time: [1.78240863,  1.90289617,  2.13884197,  2.56447107,  3.69585093,
        4.89662223,  5.93436913,  7.30333377,  7.81703857, 11.273445  ,
       17.147927  , 17.29071133, 25.60719433, 33.752106  , 39.18368533,
       54.97149033, 58.38481367, 80.789633  , 91.91843667, 92.00044667,
       91.788495  , 94.81770167]
Write CPU: [13.539226  , 12.864854  , 12.264953  , 11.091775  ,  8.97849733,
        7.8851272 ,  7.4261729 ,  6.88323817,  6.9484599 ,  6.215962  ,
        5.6230638 ,  5.56050417,  5.46666307,  5.15514553,  5.11037843,
        5.4794368 ,  5.10199417,  4.78517787,  4.74846753,  4.78315   ,
        4.7442983 ,  4.7602315]
Storage size: [206036928, 191669263, 184968258, 183435776, 179710183, 
        177680059, 175451396, 174576682, 174425777, 172263632, 170651985, 
        170651849, 170328984, 168133948, 167372150, 164283881, 164262129, 
        162550800, 162457483, 162457483, 162457483, 161842182]
Read Time: [0.22145777, 0.24182903, 0.22568557, 0.22586064, 0.22580933,
       0.21945846, 0.21445663, 0.20893256, 0.21090803, 0.2107277 ,
       0.2096266 , 0.20804636, 0.20814353, 0.2096809 , 0.2066144 ,
       0.213794  , 0.2094142 , 0.20733613, 0.20627586, 0.20990697,
       0.20317144, 0.2036086]
Read CPU: [73.86175667, 73.35632467, 78.85849333, 81.56851333, 78.93547333,
       75.818878  , 77.24738333, 78.74891733, 77.272888  , 75.72904833,
       78.51852667, 76.83330467, 78.48518733, 74.94689667, 74.43814   ,
       79.17049833, 76.88399   , 76.46573667, 77.26183667, 75.504567  ,
       76.92140467, 76.27791667]

When we visualize the output, we see a few things

Low compression level of 1, 2 or 3 is a good choice. Anything higher increases the file write times by orders of magnitude with no noticeable difference on read times. The file size keeps decreasing.
At higher compression level > 15, the write time costs will highly likely overshadow the gains of reduced storage space.

Saw similar, consistent results with 7.24 GB csv file, with max compression level set to 15.

Datatype Importance

Using the right datatype can lead to good savings on the storage space.

We use the same 724MB lineitem table csv file, save the results in Parquet and compare the Parquet file sizes with data types changed.

import pandas as pd
import os
df = pd.read_csv('lineitem.tbl', sep='|', header=None)
for i in df[list(df.columns)[:-1]].dtypes:
    print(i, end=',')
print()
path = 'data.parquet'
df[list(df.columns)[:-1]].to_parquet(path, engine='pyarrow', compression='zstd')
print(f"File size: {os.path.getsize(path)}")

int64,int64,int64,int64,int64,float64,float64,float64,object,object,object,object,object,object,object,object,
File size: 168436855

Instead of int64, float64, if we choose int32 and float32 (assuming no impact on the application/data precision)

df[[0,1,2,3,4]] = df[[0,1,2,3,4]].astype('int32')
df[[5,6,7]] = df[[5,6,7]].astype('float32')
path = 'data.parquet'
df[list(df.columns)[:-1]].to_parquet(path, engine='pyarrow', compression='zstd')
print(f"File size: {os.path.getsize(path)}")

File size: 150208751

That’s a good 10.8219% savings on file size. With the added benefit of faster read times, less memory footprint and faster data processing.

Final note

zstd is definitely a good choice over Snappy. Default zstd compression level of 1 or 3 by the packages are good defaults.
Choosing the right data type is important. High precision data types requires higher storage, memory and computing resources.
Choosing the right compression level is important. If we have data where we write once (and rarely thereafter), and not worried about write time, chose a higher compression level e.g. 15. At Terabyte, Petabyte scales, differences in storage space adds up. And, we gain better performance per cost unit.
We still need to explore on how to gain big(78% as reported) on storage savings using sort. Head over to Part 2 for the details.

Efficient Data Storage — Key to Gold: Part1

Experiment

Datatype Importance

Final note

Written by Yogesh