Leverage Plugins to Ingest Parquet Files from S3 in Pinot

Kartik Khare
Apache Pinot Developer Blog
4 min readAug 18, 2020
Photo by Feelfarbig Magazine on Unsplash

One of the primary advantages of using Pinot is its pluggable architecture. The plugins make it easy to add support for any third-party system which can be an execution framework, a filesystem, or input format.

In this tutorial, we will use three such plugins to easily ingest data and push it to our Pinot cluster. The plugins we will be using are -

  • pinot-batch-ingestion-spark
  • pinot-s3
  • pinot-parquet

You can check out Batch Ingestion, File systems, and Input formats for all the available plugins.

Setup

We are using the following tools and frameworks for this tutorial -

Pinot Ingestion and Query flow

Input Data

We need to get input data to ingest first. For our demo, we’ll just create some small Parquet files and upload them to our S3 bucket. The easiest way is to create CSV files and then convert them to Parquet. CSV makes it human-readable and thus easier to modify the input in case of some failure in our demo. We will call this file students.csv

Now, we’ll create Parquet files from the above CSV file using Spark. Since this is a small program, we will be using the Spark shell instead of writing a full-fledged Spark code.

The .parquet files can now be found in /path/to/batch_input directory. You can now upload this directory to S3 either using their UI or running the following command

aws s3 cp /path/to/batch_input s3://my-bucket/batch-input/ --recursive

Create Schema and Table

We need to create a table to query the data that will be ingested. All tables in Pinot are associated with a schema. You can check out Table configuration and Schema configuration for more details on creating configurations.

For our demo, we will have the following schema and table configs

We can now upload these configurations to Pinot and create an empty table. We will be using pinot-admin.sh CLI for this purpose.

pinot-admin.sh AddTable -tableConfigFile /path/to/student_table.json -schemaFile /path/to/student_schema.json -controllerHost localhost -controllerPort 9000 -exec

You can check out Command-Line Interface (CLI) for all the available commands.

Our table will now be available in the Pinot data explorer

Ingest Data

Now that our data is available in S3 as well as we have the Tables in Pinot, we can start the process of ingesting the data. Data ingestion in Pinot involves the following steps -

  • Read data and generate compressed segment files from input
  • Upload the compressed segment files to the output location
  • Push the location of the segment files to the controller

Once the location is available to the controller, it can notify the servers to download the segment files and populate the tables.

The above steps can be performed using any distributed executor of your choice such as Hadoop, Spark, Flink, etc. For this demo, we will be using Apache Spark to execute the steps.

Pinot provides runners for Spark out of the box. So as a user, you don’t need to write a single line of code. You can write runners for any other executor using our provided interfaces.

First, we will create a job spec configuration file for our data ingestion process.

In the job spec, we have kept the execution framework as spark and configured the appropriate runners for each of our steps. We also need a temporary stagingDir for our spark job. This directory is cleaned up after our job has executed.

We also provide the S3 Filesystem and Parquet reader implementation in the config to use. You can refer Ingestion Job Spec for a complete list of configurations.

We can now run our Spark job to execute all the steps and populate data in Pinot.

In the command, we have included the JARs of all the required plugins in the Spark’s driver classpath. In practice, you only need to do this if you get a ClassNotFoundException.

Voila! Now our data is successfully ingested. Let’s try to query it from Pinot’s broker.

bin/pinot-admin.sh PostQuery -brokerHost localhost -brokerPort 8000 -queryType sql -query "SELECT * FROM students LIMIT 10"

If everything went right, you should receive the following output

You can also view the results in the Data explorer UI.

Pinot’s powerful pluggable architecture allowed us to successfully ingest parquet records from S3 with just a few configurations. The process described in this article is highly-scalable and can be used to ingest billions of records with minimal latency.

You can check out Pinot on the official website. Refer to our documentation to get started with the setup and in the case of any issues, the community is there to help on the official slack channel.

--

--

Kartik Khare
Apache Pinot Developer Blog

Software Engineer @StarTree | Previously @WalmartLabs, @Olacabs | Committer @ApachePinot