How the right Hadoop stack can solve your BI needs

Published in

skai engineering blog

7 min readJan 17, 2018

Before the era of big data, when we were working with a few terabytes of less-varied data, gathering data in a central location and analyzing it for various needs was really straightforward. All you needed was a machine with a couple of cores and a relational database.

Relational databases provided the perfect solution (well pretty close). Relational databases support indexes, partitions, and the ability to insert and update specific rows. As long as you didn’t overload your database too much, it performed well.

As the amount of data grew, relational DB just couldn’t handle the pressure. Queries that used to take seconds now took much longer. Precalculated aggregations suddenly crashed and needed to be split into several bulk loads. The future was not looking bright.

This is exactly what we faced in Kenshoo when data volume, variety and velocity exploded.

Simple questions that could once be answered in a nutshell suddenly became a challenge!

As business users and as developers, you understand the power and necessity of interactive BI. With it you can create a powerful system to analyze and foresee intelligence in your organization. Without it no one will use your application for long.

In Kenshoo we have successfully built a BI solution based on the Big Data technologies stack, that focuses on interactivity & accessibility. So how can you create an interactive BI environment that supports a huge data set? Read on and find out :-)

Hive on Hadoop

A lot has been said about Hadoop and what it delivers to the BI world. For our purposes, let’s summarize it as a distributed file system that spreads the load on its underlying machines. In a relational DB we had to scale up (add more hardware), with Hadoop we can scale out (add more machines to get more horsepower).

One of the most common layers being used on Hadoop is Hive. Hive, a Facebook contribution to the open source world, is a framework that provides a familiar SQL-ish interface for writing queries and translating them to MapReduce jobs, which Hadoop understands. It is commonly used because it was the first layer BI tools adopted. The ability to understand SQL and be reached through an ODBC/JDBC connection made it easy to integrate and give instant access to data on HDFS.

However, even though it was the first commonly used query engine over Hadoop, it is far from being the best fit. MapReduce jobs are slow and take time to complete (sometimes forever…). They scan all the data and are not nearly fast enough to deliver the required response times.

Even though Hive isn’t a great fit for interactive BI usage, it can and should be used for analytics, and can be improved dramatically by preparing the data behind it in a way that is more suitable for a Hadoop framework.

Partitions

Hive holds a repository, a meta store, which defines every folder on HDFS as a database, every subfolder as a table, and every sub-subfolder as a partition in which data files are saved.

When a query is translated to MapReduce job, the files that need to be read are distributed among the available processing units in your Hadoop cluster. By splitting the files into sub-subfolders (partitions), the number of files that need to be scanned is reduced to those answering the query’s filtering only, speeding up the query and reducing resource consumption.

In Kenshoo our partitioning strategy had a big impact and was a crucial part of our solution. We love partitions.

File formats

Since data is stored in files, the file format plays a significant role in performance. Every format has its benefits and some are preferable over others for specific use cases.

Parquet from Cloudera, and ORC from HortonWorks have many key features that make them great for analytics.

Columnar storage — Traditional Databases store data in rows which gives you the ability to query a small portion of a data set quickly by sorting (indexing) the data set in advance. Columnar storage systems serialize and store data by columns, scanning data in specific columns across large datasets. In some cases, queries that took minutes or hours in a row-based store are completed in seconds when saved in a column-based store. This makes columnar databases a good choice in a query-heavy environment, as long as the queries you run are suited for a columnar database.
Compression — Per-column data compression further accelerates performance, since data sent across the network is smaller.
Predicate push-down — One of the most expensive parts of reading and writing data is (de)serialization. Parquet supports predicate push-down to target specific columns in your data for filtering and reading, keeping the cost of deserialization to a minimum.
Metadata — Both ORC and Parquet file formats hold very important and helpful statistics in a metadata file allowing greater speed for some queries. When a file needs to be scanned, if the statistic is already in the metadata, it will be taken from there, saving a full scan.
Splittable — The ability to start reading and process data at any point within a file. This allows load distribution among the servers in the cluster and takes advantage of the entire cluster’s processing power.

Parquet files are used intensively in Kenshoo and their benefits are felt every day. We managed to enhance performance by a factor of 5 (and even more in some cases) when utilizing Parquet format for the task.

Impala on Hadoop

A better alternative to Hive for interactive BI needs is Impala. Impala is a massively parallel processing (MPP) engine on top of HDFS. Impala has daemons running on all nodes which cache some of the data, so that these daemons can return data quickly without having to go through a whole Map/Reduce job.

The very fact that Impala, being MPP based, doesn’t involve the overhead of a MapReduce job means most of the operation is in memory. Intermediate results never hit the disk, resulting in substantial improvements in runtime.

By using the right formats and partitions for the underlying data and utilizing Impala as our SQL tool on top of HDFS we are starting to get close to our goals.

Advanced BI Tools

Impala delivers the speed of running on a relational DB, while actually analyzing big data stored as files (hopefully Parquet files) in our HDFS. But if you think about it, for every click a query is generated and transferred to impala which queries the files on HDFS and returns an answer to the BI tool. These kinds of operations just won’t run fast enough. You need to wait a few seconds between clicks, which is quite irritating. I know many users are already used to this kind of service, but believe me it won’t last long till those users stop using your application.

Modern BI tools come with better solutions. For example Tableau has a data extract feature that helps speed things up. A Tableau data extract is a compressed snapshot of data stored on your local BI server disk and loaded into memory to render a Tableau visualization.

So instead of taking a trip to the DB, a quick scan of the local server or memory is all that is needed. Instead of waiting a few seconds from one action to the other, suddenly you have an almost instant response time.

This approach allows interactive dashboarding with the fast response times that answer our needs. Sadly this approach won’t stand up to the challenges of big data. Since you can’t load all of your data into local storage, a portion is selected, or it is aggregated to a best fit. Neither option allows full access to all of your data.

Cubes

An OLAP cube is a method of storing data in a multidimensional form in which the different dimensions, measures and hierarchies are pre-calculated.

The cube needs to be calculated before using it, but afterwards many calculations are available. When a BI tool queries the cube, it often just needs to locate the pre-calculated result and visualize it. This cuts query processing time and delivers a fast and interactive way to communicate with our data.

Cubes started to be difficult to maintain as data grew and the execution time started to increase.

To maintain performance you needed to scale up and add expensive hardware, until a performance monster was created (a very expensive one indeed).

With the evolution of distributed systems such as Hadoop, another option is to scale out to add more computation power (which is a lot cheaper) and spread the load on more machines. This has made the cube more relevant than ever.

Conclusions

The goal of this article was to show that interactive BI is possible in the world of big data. If you play your cards right and leverage the most relevant technologies for your use case, a lot can be achieved. Each of the solutions explained above gives a way to interact with data that is ideal for a specific use case.

The bottom line is that having big data is not an excuse for having a slow or an unresponsive BI application. There are many ways to set up an interactive BI that encourage high adoption among decision makers and provide a hell of a lot business value.

How the right Hadoop stack can solve your BI needs

Written by Aviad Haiventriger