Exploring OLAP on Kubernetes with Apache Pinot

Greg Simons
Apache Pinot Developer Blog
6 min readJun 10, 2020

It was April 2020 when I first heard about Apache Pinot through a tweet from Kenny Bastani, who happens to be my podcast co-host and an open source contributor to Pinot.

Online Analytical Processing (OLAP) has been around for many years, but there aren’t many data warehouse products out there designed specifically for the “cloud-native era”. Apache Pinot was forged from engineering at LinkedIn and is currently in incubation mode at the Apache Foundation. As for production users? The tool already boasts a deep bench of users at both Uber and Microsoft.

Considering the long history of OLAP and data warehousing — I found myself asking, what separates Pinot from its alternatives? In this article, I’ll explore the initial setup experience I had with Pinot and talk about some sample cloud-native use cases.

Getting started

I decided to try Pinot out locally on a Kubernetes cluster running on my machine. The first thing I needed to do was get an up-to-date (at the time of writing) minikube installation. For this reason, I have included some quick-links to the instructions I used to set this up on my MacOS.

The full instructions for getting started with Apache Pinot on Kubernetes are available here.

If you are installing a new Kubernetes cluster to your machine, and want to go with minikube, make sure you have VirtualBox installed.

If this initial setup has worked as expected, executing minikube status results in the following output:

minikube status (output)

One element that is missing from the guide currently is the installation of Helm. I carried this out simply through Homebrew using the following reference.

Minikube is required to be initiated with the following memory requirements in order for the pinot-broker component to start. (thanks to this Github Repo for helping me find the required settings!)

Configure your minikube installation or Pinot

Once helm is installed switch your kubectl context to minikube.

And as per the getting started guide for Pinot…

output from verifying the status of the pinot-quickstart namespace
minikube dashboard (pinot quickstart namespace)

Accessing some of the Pinot component UI’s can be carried out via Port Forwarding. An example of this exists in incubator-pinot/kubernetes/helm/pinot/query-pinot-data.sh

kubectl port-forward service/pinot-controller 9000:9000 -n pinot-quickstart > /dev/null &

Components

Initially when getting started with Apache Pinot, I looked through the documentation on the various architectural responsibilities of each component. There is a detailed outline of the clustering setup involved for Pinot (https://docs.pinot.apache.org/basics/architecture) which is omitted from the descriptions here. For getting started, I have summarised and largely simplified the components into the following descriptions.

Segments

As per the documentation:

  • a horizontal shard representing a chunk of table data with some number of rows

Due to Pinot expecting tables to infinitely scale, a shard can be seen as a time-based partition that is the result of the data being broken down in to smaller chunks. (Segment)

Pinot Server

There are two types of servers you can setup for Pinot. One being offline and the other one we will focus on later in this article; realtime.

Offline servers will be covered in a follow up article on ingesting data and defining the table configuration and specifications. Realtime servers ingest data from a real time streams such as Apache Kafka, and data can be transformed and then persisted ready to serve queries via the Pinot broker.

Pinot Broker

Query against the Pinot Servers

Pinot Controller

The controller component’s primary responsibility is the management of the other Pinot components. You can view some of the APIs available to you via the help endpoint of the controller.

http://localhost:9000/help

http://localhost:9000/query

Pinot data explorer

Streaming events (the write)

To get started with ingesting data and adding associated queries, I focused on the following components and paths highlighted in the diagram below:

Primary Architectural Components of Pinot in a read / write split

There is already a great guide demonstrating how to locally set up a Kafka cluster and topic in the Pinot documentation for realtime ingestion and that is utilised in the example.

With the growing need to standardise event structure, the Cloud Events specification has been formed. The example uses this Event structure from Cloud events to create a schema in Apache Pinot. These events will be consumed from the Kafka topic and persisted in the segment stores via Pinot Server so that queries can be invoked via the Broker.

Ingest from Apache Kafka

In order to get started (even with the examples guide) there is an assumption you have progressed through the manual cluster setup guide. As this guide is using minikube to host the Pinot cluster I will add Kafka to Kubernetes using the Helm charts listed in the getting started guide. Run these three commands in your terminal.

$ helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator$ helm install -n pinot-quickstart kafka incubator/kafka — set replicas=1$ kubectl get all -n pinot-quickstart |grep kafka
Running Kafka services and pods inside Kubernetes

Creating the schema

I decided to use the cloud events json data format example for the schema. It is specified as follows:

JSON data format from Cloud Events

Create Topic

Run the following command to create the Kafka topic.

$ kubectl -n pinot-quickstart exec kafka-0 — kafka-topics — zookeeper kafka-zookeeper:2181 — topic cloudevent-topic — create — partitions 1 — replication-factor 1

Upload Schema and table config

The sample cloud event topic schema, table config and Kubernetes manifest for the example is available on GitHub. If you clone the repository and change into the directory, it will be possible to run the following command to create the schema and table config in Pinot and be ready for the next example where I will take a closer look at ingesting events.

$ kubectl apply -f cloud-event-quickstart.yml

Table and schema creation successful

If you wish to test out the help API you can now interact with your schema and table via the Swagger interface:

http://localhost:9000/help

The Pinot data explorer is now ready to service queries against the table created above:

Pinot data explorer showing the cloudevent table

Summary

At this point you will now have a:

  • Pinot cluster inside Kubernetes
  • Kafka running inside Kubernetes
  • A topic ready for a Pinot real-time server to consume cloud events ready to persist and serve queries
  • A schema and table config for persisting ingested data

In follow up to this article, we will look in more detail at the table configuration and the persistence mechanisms behind real-time servers, and begin to consume cloud events from the stream before exploring queries on the data. If you’re interested in learning more about Pinot before I put out my next blog post, please take a look at the other articles in the Apache Pinot Developer Blog.

--

--

Greg Simons
Apache Pinot Developer Blog

Software Architect, Technology enthusiast, husband and father