Running parquet-tools on a Redhat Based Distribution (Amazon Linux 2)

Preshen Goobiah
1 min readApr 4, 2019

To get started with inspecting Parquet files, you can use parquet-tools from the parquet-mr project on GitHub.

In my use of this tool, I built the .jar file with Maven and ran it with the Java Runtime Environment (JRE) .

This was done on an Amazon Linux 2 (ami-07683a44e80cd32c5) EC2 Instance.

Lets dive straight into the setup and use of this project!

Step 1: Update the environment & Install JRE, Maven, Git using yum

The default ec2-user has root access on Amazon Linux 2

sudo yum update -y

sudo yum install java-1.8.0-openjdk maven git -y

Step 2: Clone the parquet-mr repo

git clone https://github.com/apache/parquet-mr.git

Step 3: Build the .jar file (this may take a while)

cd parquet-mr/parquet-tools/

mvn clean package -Plocal

Step 4: Run the .jar file

cd target/

#Check the version of parquet-tools with ls

ls

java -jar parquet-tools-<VERSION>-SNAPSHOT.jar

That’s it! You can now inspect parquet files, as an example you can view the parquet file in JSON format

java -jar parquet-tools-1.12.0-SNAPSHOT.jar cat --json <PARQUET_FILE>

Thanks for reading my first article, hit clap if it worked for you :)

--

--