Polaris & Apache Druid: The Fast Lane to Data Ingestion

Moti Zamir
Israeli Tech Radar
Published in
6 min readJul 23, 2023
Data pipeline architecture

Are you tired of staring at your Excel sheets for hours on end, wondering if there’s a better way to analyze your data? Well, fear not, my dear reader! Apache Druid is here to rescue you from the mundane world of spreadsheets and take you on a wild ride of real-time data ingestion and querying.

But what is Apache Druid? And why is this a good solution for this case?

Apache Druid is a columnar open-source OLAP database that keeps its data in columnar method, and in addition it has partition to several segments that responsible for the magic of querying 5 million rows in less than 1 second! You can read more about it here.

In this article, we’ll dive deep into the world of Excel migration and explore the benefits, challenges, and best practices for a successful migration to Druid. We’ll even throw in some humorous anecdotes to keep things interesting.

So buckle up, grab your favorite caffeinated beverage, and get ready to say goodbye to those boring spreadsheets and hello to the exciting world of Apache Druid!

Apache Druid
Apache Druid

In my journey I used a SaaS product called Polaris, which is a fully-managed cloud database service over Apache Druid, this makes everything easier and removes the infrastructure configuration.

Let’s kick things off from the beginning, shall we?

In my previous project, I was tasked with the implementation of an ML pipeline that processes multiple data in order to predict system failures.

The raw data was saved in Excel sheets, that was transferred through calculation and analysis process, including calculating Eagan vectors and Hilbert space values (I know, sounds like Chinese..), and ended with a training table for running the prediction model. The training table was also saved in excel file, and was uploaded to the script that was responsible of running the model with the training data and created the final prediction table.

So what is the problem anyway? And why not stay with excel sheets?

First of all, querying data from excel sheets, or even from CSV files is extremely slow, and the client requirements were fast database with near realtime response. Since Apache Druid can provide that ability, it was chosen to be the database engine for our project.

I am pulling your data, my dear friend

My job started here. The solution was to take those CSV files and to create a Druid table for the raw data, and then query that table for the ETL process. The transformed data will be also stored in another Druid. Additionally, the processed data will be stored in a separate Druid instance.

Polaris is a managed cloud service that is based on Apache Druid and gives us the freedom from allocating computing and storage resources. It has a lot of useful features like DataCube and Pivot for easily creation of dashboards, and we can choose to use it through the UI or its API.

Polaris — The North Star

In that post I will demonstrate uploading a CSV file to the Polaris engine, creating a new table and ingesting the data from the file to the new table.

Remember Covid 19? Great! I used a public CSV file from WHO as my raw data source. It has 305,019 rows of Covid cases around the world, cataloged by WHO region and country

Here is the landing page of the Polaris UI:

Main screen of Polaris API
Landing page of Polaris

Let’s create a new table and call it covid_19_WHO:

Table covid_19_WHO creation menu

Now we need to ingest the data to the new table. Polaris UI has a great feature that allows new tables without defined schema to build their schema from the ingested data, so we can use the ingestion file and the table will identify each column and its type. In addition, there is an option to edit the schema by adding, removing or renaming columns.

I already downloaded the covid dataset to my machine, and now I will upload it to the Polaris engine. Let’s press the ‘sources’ link and go to that screen:

Sources uploading screen

Let’s choose ‘Files’ and upload the CSV file from our machine. There are many other sources ingestion types like Apache Kafka by streaming ingestion or Amazon S3 bucket for another batch ingestion, but for the sake of this post I will use the CSV file.

I uploaded the CSV file to Polaris engine. It appears now in the sources tab in the menu:

Files uploaded to Polaris engine

I keep the default schema and continue.

The schema identifier before ingestion

In Druid tables there is a required column called “_time” that helps the Druid engine to create its internal partitions which turns the queries from Druid table to be so fast.

the _time column is used for shrinking the table rows into smaller number of rows and indexing, what accelerate the search speed in the database.

We are about to start ingestion!

Everything is ready for ingestion the data to our table. The ingestion process runs as a job and the data partitioning and the division for segments occurs during the job’s running.

That’s it! the table is ready for querying. Let’s take a quick look at it:

As we can see, the Polaris engine identified the data types of the columns without configure it manually, according to the data types of the columns values at the CSV file. The column “Date reported” at the source table was identified as a timestamp and replaced by the “_time” column with the same values.

Awesome! now I want to query the table and use it for analytics. We can pull the data as a JSON object, or we can query part of it with SQL syntax. That is make the whole business easier for data analysts that are not so familiar with programming but has experience with SQL.

Here is a simple query that I created in the SQL menu in Polaris dashboard:

SQL query from our new table

And the result:

Query results

At the Polaris SQL dashboard we are able to see the number of rows in the query result, and also the execution time of the query:

Time execution in Polaris

On the contrary, when executing an identical query within an Excel sheet using the query function, the results were obtained in approximately 3 seconds, which represents a considerable increase in processing time compared to the implementation of Apache Druid, nearly tenfold.

Now I can start to transform my raw data with Pandas or Polars, or using it for create analysis dashboards with Polaris Data Cube and Pivot services. You are more than welcome to check it out.

In conclusion:

This is a basic demonstration of how to use Polaris engine for batch ingestion with a CSV file for our local machine. At the next post I will walk through the same process but with the help of the Polaris API and Python. Stay tuned!

--

--