How to connect Excel to ClickHouse within Kyligence Tiered Storage

Published in

Kyligence

8 min readDec 3, 2021

Modern corporations are increasingly looking for near real-time analytics and insights to make actionable decisions. To help organizations understand more about the benefits of ClickHouse, we will focus on how you can achieve sub-second analysis in Excel with Kyligence Tiered Storage.

Why you will be interested

This article is based on the benchmark dataset of the retail industry, which contains over 100 MILLION rows in the transaction table, I will operate with dimensions and measures in Excel pivot to compare the query performance among Object Storage, Base Index and Tiered Storage.

Why Excel

Microsoft Excel is, undoubtedly, the most popular spreadsheet software in the market. Each iteration of this software gives the users a better tool for their statistical needs.
PivotTable in Excel is a powerful tool to calculate, summarize, and analyze data that allows you to see comparisons, patterns, and trends in your data.

Why ClickHouse

ClickHouse is an open-source, OLAP (Online Analytical Processing) database originally developed by the company Yandex, for the needs of its Metrica solution (similar to Google Analytics). Yandex is the first search engine used in Russia.
ClickHouse was developed with a simple objective: to filter and aggregate as much data as possible and as quickly as possible. Similar to other solutions of the same type (eg. Druid, Pinot), ClickHouse uses a column-oriented model for data storage. It also relies on various parallelization and vectorization mechanisms to take the most advantage of multi-core architectures. Hence, Clickhouse can support data volumes of several petabytes.

Let’s start with Kyligence Cloud

First, I used Kyligence Cloud, if you also like to try Kyligence cloud, you may visit Kyligence to get a free trial. In order to analyze data, it is necessary to import data. I used a 100 million rows dataset which can fully exhibit strong query performance of ClickHouse.

The SSB (Star Schema Benchmark) dataset is designed to measure the performance of database products in support of classical data warehousing applications and is based on the TPC-H benchmark.There are 5 tables in the SSB dataset — table sampling is turned on by default. With sampling I can get statistical information of the entire dataset.

I ran a sample SQL query to check the row amount of SSB.LINEORDER table. As you can see from the screenshot below, there are over 1 million rows in this table.

Auto-Modeling

In Kyligence Cloud, I enabled AI Augmented Mode to build models with SQL.

Created a model automatically from SQL command.

Below is my SQL command which was used for creating the model.

SELECT
    SSB.LINEORDER.LO_ORDERKEY,
    SSB.LINEORDER.LO_LINENUMBER,
    SSB.LINEORDER.LO_CUSTKEY,
    SSB.LINEORDER.LO_PARTKEY,
    SSB.LINEORDER.LO_SUPPKEY,
    SSB.LINEORDER.LO_ORDERDATE,
    SSB.LINEORDER.LO_ORDERPRIOTITY,
    SSB.LINEORDER.LO_SHIPPRIOTITY,
    SSB.LINEORDER.LO_TAX,
    SSB.LINEORDER.LO_COMMITDATE,
    SSB.LINEORDER.LO_SHIPMODE,
    SSB.PART.P_PARTKEY,
    SSB.PART.P_NAME,
    SSB.PART.P_MFGR,
    SSB.PART.P_CATEGORY,
    SSB.PART.P_BRAND,
    SSB.PART.P_COLOR,
    SSB.PART.P_TYPE,
    SSB.PART.P_SIZE,
    SSB.PART.P_CONTAINER,
    SSB.SUPPLIER.S_SUPPKEY,
    SSB.SUPPLIER.S_NAME,
    SSB.SUPPLIER.S_ADDRESS,
    SSB.SUPPLIER.S_CITY,
    SSB.SUPPLIER.S_NATION,
    SSB.SUPPLIER.S_REGION,
    SSB.SUPPLIER.S_PHONE,
    SSB.CUSTOMER.C_CUSTKEY,
    SSB.CUSTOMER.C_NAME,
    SSB.CUSTOMER.C_ADDRESS,
    SSB.CUSTOMER.C_CITY,
    SSB.CUSTOMER.C_NATION,
    SSB.CUSTOMER.C_REGION,
    SSB.CUSTOMER.C_PHONE,
    SSB.CUSTOMER.C_MKTSEGMENT,
    SSB.CALENDAR.D_DATEKEY,
    SSB.CALENDAR.D_DATE,
    SSB.CALENDAR.D_DAYOFWEEK,
    SSB.CALENDAR.D_MONTH,
    SSB.CALENDAR.D_YEAR,
    SSB.CALENDAR.D_YEARMONTHNUM,
    SSB.CALENDAR.D_YEARMONTH,
    SSB.CALENDAR.D_DAYNUMINWEEK,
    SSB.CALENDAR.D_DAYNUMINMONTH,
    SSB.CALENDAR.D_DAYNUMINYEAR,
    SSB.CALENDAR.D_MONTHNUMINYEAR,
    SSB.CALENDAR.D_WEEKNUMINYEAR,
    SSB.CALENDAR.D_SELLINGSEASON,
    SSB.CALENDAR.D_LASTDAYINWEEKFL,
    SSB.CALENDAR.D_LASTDAYINMONTHFL,
    SSB.CALENDAR.D_HOLIDAYFL,
    SSB.CALENDAR.D_WEEKDAYFL,
    SUM(SSB.LINEORDER.LO_REVENUE) AS SUM_REVENUE,
    SUM(SSB.LINEORDER.LO_SUPPLYCOST) AS SUM_COST
FROM
    SSB.LINEORDER
    LEFT JOIN SSB.PART ON SSB.LINEORDER.LO_PARTKEY=SSB.PART.P_PARTKEY
    LEFT JOIN SSB.CUSTOMER ON SSB.LINEORDER.LO_CUSTKEY=SSB.CUSTOMER.C_CUSTKEY
    LEFT JOIN SSB.SUPPLIER ON SSB.LINEORDER.LO_SUPPKEY=SSB.SUPPLIER.S_SUPPKEY
    LEFT JOIN SSB.CALENDAR ON SSB.LINEORDER.LO_ORDERDATE=SSB.CALENDAR.D_DATEKEY
GROUP BY
    SSB.LINEORDER.LO_ORDERKEY,
    SSB.LINEORDER.LO_LINENUMBER,
    SSB.LINEORDER.LO_CUSTKEY,
    SSB.LINEORDER.LO_PARTKEY,
    SSB.LINEORDER.LO_SUPPKEY,
    SSB.LINEORDER.LO_ORDERDATE,
    SSB.LINEORDER.LO_ORDERPRIOTITY,
    SSB.LINEORDER.LO_SHIPPRIOTITY,
    SSB.LINEORDER.LO_TAX,
    SSB.LINEORDER.LO_COMMITDATE,
    SSB.LINEORDER.LO_SHIPMODE,
    SSB.PART.P_PARTKEY,
    SSB.PART.P_NAME,
    SSB.PART.P_MFGR,
    SSB.PART.P_CATEGORY,
    SSB.PART.P_BRAND,
    SSB.PART.P_COLOR,
    SSB.PART.P_TYPE,
    SSB.PART.P_SIZE,
    SSB.PART.P_CONTAINER,
    SSB.SUPPLIER.S_SUPPKEY,
    SSB.SUPPLIER.S_NAME,
    SSB.SUPPLIER.S_ADDRESS,
    SSB.SUPPLIER.S_CITY,
    SSB.SUPPLIER.S_NATION,
    SSB.SUPPLIER.S_REGION,
    SSB.SUPPLIER.S_PHONE,
    SSB.CUSTOMER.C_CUSTKEY,
    SSB.CUSTOMER.C_NAME,
    SSB.CUSTOMER.C_ADDRESS,
    SSB.CUSTOMER.C_CITY,
    SSB.CUSTOMER.C_NATION,
    SSB.CUSTOMER.C_REGION,
    SSB.CUSTOMER.C_PHONE,
    SSB.CUSTOMER.C_MKTSEGMENT,
    SSB.CALENDAR.D_DATEKEY,
    SSB.CALENDAR.D_DATE,
    SSB.CALENDAR.D_DAYOFWEEK,
    SSB.CALENDAR.D_MONTH,
    SSB.CALENDAR.D_YEAR,
    SSB.CALENDAR.D_YEARMONTHNUM,
    SSB.CALENDAR.D_YEARMONTH,
    SSB.CALENDAR.D_DAYNUMINWEEK,
    SSB.CALENDAR.D_DAYNUMINMONTH,
    SSB.CALENDAR.D_DAYNUMINYEAR,
    SSB.CALENDAR.D_MONTHNUMINYEAR,
    SSB.CALENDAR.D_WEEKNUMINYEAR,
    SSB.CALENDAR.D_SELLINGSEASON,
    SSB.CALENDAR.D_LASTDAYINWEEKFL,
    SSB.CALENDAR.D_LASTDAYINMONTHFL,
    SSB.CALENDAR.D_HOLIDAYFL,
    SSB.CALENDAR.D_WEEKDAYFL
;

Before saving the current model, cancel the option of Add Base Indexes.

Finally we have a start schema model for data analysis.

MDX

After building the model, let’s explore MDX in Kyligence Cloud.

I need to set a name for MDX dataset. Here, the dataset was Kyligence Cloud model.

You could even define more complicated relationships among MDX datasets.

When defining Semantics, you can define more objects, like Dimension, Measure, Hierarchy, Calculated Measure, etc. These semantics will be synchronized to uplevel BI tools. In Tableau, PowerBI or Excel, end users can drag & drop MDX objects to analyze the data. Let’s create a Profit measure as an example, which equals subtracting cost from revenue. profit = revenue — cost.

Connect Excel with MDX

Opened Microsoft Excel from Windows 11 , or other version of Windows.

Click Data > Get Data > From Database > From Analysis Services. Note: this service only works in Windows. In MacOS, Analysis Services is not yet compatible.

Kyligence Cloud provides enough clear connection information, which helped me set up a connection between Excel and MDX.

Create Excel pivot table

Query Performance

First, add “YEAR” “CATEGORY” in Rows, “Revenue” in Values.

It took 17s to get all the results.

From Query History, the MDX query had been parsed into different parts of SQL queries. The queries were answered by OBJECT STORAGE, which means they got push down to the original dataset.

Second, add “YEAR” “CATEGORY” in Rows, “Supplycost” in Values.

It took 11s.

Third, add “YEAR” “CATEGORY” “REGION””NATION” in Rows, “Profit” in Values.

Base Index

If there is no base index or when a base index is missing, click + Index to add the base indexes in the drop-down box.

We often treat base index as safety index because it includes all dimensions and measures. Base index can answer the ad hoc queries on the fly, then avoid queries pushing down to object storage. To benefit from the base index, you should build your segments based on the base index.

After finishing building the base index, I found queries could be answered by model instead of object storage. Hence, base indexes can avoid queries pushed downto original datasets.

Here I’m listing the query performance of base table index, with 3 of the same queries in Excel as before.

Improve Query Performance with ClickHouse

Kyligence Cloud 4.5 provides our users with the ClickHouse option. To benefit from ClickHouse, user has to add another node named Tiered Storage Node when installing Wordspace in their clouds.

Also, Tiered Storage needs to be turned on in the “Setting” page.

Chose the base table index, then “Load to the Tiered Storage”.

It will take some time to load data to ClickHouse.

After the data was loaded into ClickHouse, I ran the same 3 queries in Excel pivot as before.

Then, I compared these three engines: “Query Pushdown,” “Base table index” and “Tiered Storage with ClickHouse.” The performance of Tiered Storage with the same queries was really impressive. I tried some other combinations of dimensions and measures, and the performance of ClickHouse was relatively fast even in self-service analysis or ad-hoc.

Kyligence Tiered Storage

Powered by the technology of ClickHouse, Kyligence Cloud 4.5 offers our users the advanced feature of Tiered Storage, which can help users quickly and cold start to query without pre-computation, and can significantly improve the performance of ultra-multi-dimensional flexible analysis and detailed query analysis. At the same time, the two-level storage design based on HDFS/object storage and ClickHouse can fully cover various analysis scenarios, bringing users more possibilities for analysis and exploration based on massive data.

Tiered storage not only supports flexible ad-hoc query scenarios, such as tagging analysis or user behavior analysis, which will bring greater assistance to the enterprise’s refined operation and auxiliary business decision-making; it also can be used with the Kyligence unified model to manage data so as to seamlessly connect with mainstream business BI tools and Excel capabilities to empower business analysts to analyze.

After enabling tiered storage, users do not need to worry about how data is distributed and how to import, they only need to focus on modeling according to business needs. This can greatly reduce the cost of modeling and model optimization. Tiered Storage can make business more agile in the future.

The following figure shows the architecture of tiered storage: