Running parquet-tools on a Redhat Based Distribution (Amazon Linux 2)
To get started with inspecting Parquet files, you can use parquet-tools from the parquet-mr project on GitHub.
In my use of this tool, I built the .jar file with Maven and ran it with the Java Runtime Environment (JRE) .
This was done on an Amazon Linux 2 (ami-07683a44e80cd32c5) EC2 Instance.
Lets dive straight into the setup and use of this project!
Step 1: Update the environment & Install JRE, Maven, Git using yum
The default ec2-user has root access on Amazon Linux 2
sudo yum update -y
sudo yum install java-1.8.0-openjdk maven git -y
Step 2: Clone the parquet-mr repo
git clone https://github.com/apache/parquet-mr.git
Step 3: Build the .jar file (this may take a while)
cd parquet-mr/parquet-tools/
mvn clean package -Plocal
Step 4: Run the .jar file
cd target/
#Check the version of parquet-tools with ls
ls
java -jar parquet-tools-<VERSION>-SNAPSHOT.jar
That’s it! You can now inspect parquet files, as an example you can view the parquet file in JSON format
java -jar parquet-tools-1.12.0-SNAPSHOT.jar cat --json <PARQUET_FILE>
Thanks for reading my first article, hit clap if it worked for you :)