Polaris & Apache Druid: The Fast Lane to Data Ingestion

Published in

Israeli Tech Radar

6 min readJul 23, 2023

Are you tired of staring at your Excel sheets for hours on end, wondering if there’s a better way to analyze your data? Well, fear not, my dear reader! Apache Druid is here to rescue you from the mundane world of spreadsheets and take you on a wild ride of real-time data ingestion and querying.

But what is Apache Druid? And why is this a good solution for this case?

Apache Druid is a columnar open-source OLAP database that keeps its data in columnar method, and in addition it has partition to several segments that responsible for the magic of querying 5 million rows in less than 1 second! You can read more about it here.

In this article, we’ll dive deep into the world of Excel migration and explore the benefits, challenges, and best practices for a successful migration to Druid. We’ll even throw in some humorous anecdotes to keep things interesting.

So buckle up, grab your favorite caffeinated beverage, and get ready to say goodbye to those boring spreadsheets and hello to the exciting world of Apache Druid!

In my journey I used a SaaS product called Polaris, which is a fully-managed cloud database service over Apache Druid, this makes everything easier and removes the infrastructure configuration.

Let’s kick things off from the beginning, shall we?

In my previous project, I was tasked with the implementation of an ML pipeline that processes multiple data in order to predict system failures.

The raw data was saved in Excel sheets, that was transferred through calculation and analysis process, including calculating Eagan vectors and Hilbert space values (I know, sounds like Chinese..), and ended with a training table for running the prediction model. The training table was also saved in excel file, and was uploaded to the script that was responsible of running the model with the training data and created the final prediction table.

So what is the problem anyway? And why not stay with excel sheets?

First of all, querying data from excel sheets, or even from CSV files is extremely slow, and the client requirements were fast database with near realtime response. Since Apache Druid can provide that ability, it was chosen to be the database engine for our project.

My job started here. The solution was to take those CSV files and to create a Druid table for the raw data, and then query that table for the ETL process. The transformed data will be also stored in another Druid. Additionally, the processed data will be stored in a separate Druid instance.

Polaris is a managed cloud service that is based on Apache Druid and gives us the freedom from allocating computing and storage resources. It has a lot of useful features like DataCube and Pivot for easily creation of dashboards, and we can choose to use it through the UI or its API.

In that post I will demonstrate uploading a CSV file to the Polaris engine, creating a new table and ingesting the data from the file to the new table.

Remember Covid 19? Great! I used a public CSV file from WHO as my raw data source. It has 305,019 rows of Covid cases around the world, cataloged by WHO region and country

Here is the landing page of the Polaris UI:

Main screen of Polaris API — Landing page of Polaris

Let’s create a new table and call it covid_19_WHO:

Now we need to ingest the data to the new table. Polaris UI has a great feature that allows new tables without defined schema to build their schema from the ingested data, so we can use the ingestion file and the table will identify each column and its type. In addition, there is an option to edit the schema by adding, removing or renaming columns.

I already downloaded the covid dataset to my machine, and now I will upload it to the Polaris engine. Let’s press the ‘sources’ link and go to that screen:

Let’s choose ‘Files’ and upload the CSV file from our machine. There are many other sources ingestion types like Apache Kafka by streaming ingestion or Amazon S3 bucket for another batch ingestion, but for the sake of this post I will use the CSV file.

I uploaded the CSV file to Polaris engine. It appears now in the sources tab in the menu:

I keep the default schema and continue.

In Druid tables there is a required column called “_time” that helps the Druid engine to create its internal partitions which turns the queries from Druid table to be so fast.

the _time column is used for shrinking the table rows into smaller number of rows and indexing, what accelerate the search speed in the database.

Everything is ready for ingestion the data to our table. The ingestion process runs as a job and the data partitioning and the division for segments occurs during the job’s running.

That’s it! the table is ready for querying. Let’s take a quick look at it:

As we can see, the Polaris engine identified the data types of the columns without configure it manually, according to the data types of the columns values at the CSV file. The column “Date reported” at the source table was identified as a timestamp and replaced by the “_time” column with the same values.

Awesome! now I want to query the table and use it for analytics. We can pull the data as a JSON object, or we can query part of it with SQL syntax. That is make the whole business easier for data analysts that are not so familiar with programming but has experience with SQL.

Here is a simple query that I created in the SQL menu in Polaris dashboard:

And the result:

At the Polaris SQL dashboard we are able to see the number of rows in the query result, and also the execution time of the query:

On the contrary, when executing an identical query within an Excel sheet using the query function, the results were obtained in approximately 3 seconds, which represents a considerable increase in processing time compared to the implementation of Apache Druid, nearly tenfold.

Now I can start to transform my raw data with Pandas or Polars, or using it for create analysis dashboards with Polaris Data Cube and Pivot services. You are more than welcome to check it out.

In conclusion:

This is a basic demonstration of how to use Polaris engine for batch ingestion with a CSV file for our local machine. At the next post I will walk through the same process but with the help of the Polaris API and Python. Stay tuned!

Polaris & Apache Druid: The Fast Lane to Data Ingestion

Let’s kick things off from the beginning, shall we?

In conclusion:

Written by Moti Zamir