Parquet Files in Object Storage for Efficient Analytics

Obed Vega
4 min readJul 6, 2023

--

NOS Capability Overview

I got into college in 2005. I remember during that time I needed to turn in my homework on a floppy disk. Yes, you heard it right, a floppy disk. Don’t know what a floppy disk is? This is a floppy disk.

Technology has changed so much in the last few years, from the way we see movies and listen to music to rockets that can land and be relaunched, to the way we store our data.

We don’t rely on old-school storage methods anymore. Back in the day, we used big, clunky hard disk drives (HDDs) to store all our important stuff. These machines had spinning disks and read/write heads to handle data. But they couldn’t keep up with our growing data demands — they had limited space and were slower than molasses!

But the storage revolution keeps going strong! Cloud storage came onto the scene and changed the game completely. It gave regular folks and businesses the power to store their data in secure and scalable remote environments. Companies like Amazon S3, Google Cloud Storage, and Microsoft Azure came up with flexible and affordable options that freed us from the constraints of physical storage setups.

NOS

Here is where NOS (Native Objetc Store) comes into the picture.

Native Object Storage (NOS) is a feature in Vantage that enables querying of data stored in CSV, JSON, and Parquet format datasets. These datasets reside in external S3-compatible object storage, including AWS S3, Google GCS, Azure Blob, or on-prem implementations. NOS proves beneficial in situations where data exploration is desired without the need for a data pipeline to bring it into Vantage. This tutorial showcases the process of exporting data from Vantage to object storage using the Parquet file format.

First Things First

Before jumping into the tutorial, let’s explain a couple of things. You probably already know this, but let’s have a quick reminder of what a is:

  1. Parquet Files.
  2. Object Store.

Parquet Files

Parquet is a columnar storage file format that is optimized for big data processing and analytics. It is designed to improve query performance and reduce storage costs for large datasets. Parquet files are commonly used in big data frameworks like Apache Hadoop, Apache Spark, and Apache Arrow.

Parquet stores data in a compressed and columnar format, which provides several benefits. Firstly, it allows for efficient compression, resulting in smaller file sizes compared to row-based formats. This reduces storage costs and improves I/O performance.

This is how parquet files looks compared to a json file:

Parquet are binary files and are composed of row groups, header and footer. Each row group contains data from the same columns.
json file compared to a parquet file

Object Store

Object storage is a type of data storage architecture that manages data as objects rather than traditional file hierarchy or block storage. In object storage, data is stored as individual objects, each with a unique identifier or key. These objects can be of any type, such as files, images, videos, or documents, and they can vary in size from a few bytes to terabytes or more.

Unlike file or block storage, object storage does not rely on a hierarchical file system structure. Instead, objects are stored in a flat address space and organized based on their unique identifiers. This provides a highly scalable and flexible storage solution that can handle large volumes of data across distributed systems.

Object storage comparison

Tutorial!

Let’s dive right into the tutorial and explore the process of creating Parquet files in object storage. This article, authored by me!, is featured in the official Teradata documentation, providing a comprehensive guide to help you along the way.

To learn more about how to create Parquet files, access the tutorial through the following link: Create Parquet Files in NOS

Conclusion

In conclusion, object storage has emerged as a highly beneficial solution for managing and storing data. Its inherent scalability, durability, and cost-effectiveness make it a preferred choice for individuals and businesses alike. With the ability to handle massive amounts of data and adapt to changing storage needs, native object storage from Teradata provides a reliable and efficient way to store and retrieve information.

More information about NOS available for you

--

--