Analytics Challenges — Modern BI Architectures (Part 2)
SQL-on-Hadoop
In the previous part of this series, I have tried to provide alternative options to deliver modern BI Solutions that meets the criterias.
In this article, we will evaluate Hadoop only solution alternatives to meet all BI requirements: high concurrency, high performance (sub-second queries), portability & interoperability, fast ingestion, support & acceptance in the market with skillset availability, low cost, low footprint (no data duplication), no data engineering to maintain solution.
Usually, Hadoop is used for historical data and ETL offloading. It is because of EDW deployments alone on-the-premise can’t scale well beyond 100 TB for the increasing storage requirements and decreasing performance. Besides, traditional EDW solutions may not manage new data types such as JSON, etc.
So, Hadoop is quite a valid option as an on premise option…
Hadoop has several options which might be used to replace EDW functionality (to serve data marts to end users):
- SQL-on-Hadoop — Hive LLAP, Impala, Presto, IBM BigSQL
- Virtualized in-memory File System — Alluxio
- OLAP-on-Hadoop — Druid, AtScale, Jethro
SQL-on-Hadoop
Once the required data model is created on Hadoop usually in star/snowflake schema, it can be served to BI tools (such as Micro Strategy or Tableau) through SQL access which may offer good query response times small data sets and low-to-medium concurrent users.
There are several SQL-on-Hadoop options which may be suitable for end-users for BI purposes such as Presto, Hive LLAP, IBM Big SQL.
- Hive LLAP (in-memory MPP solutions optimized for ORC) uses the Hive execution engine which is based on Map Reduce either on disk/memory with Tez. The latency in the Map Reduce is mostly attributed to YARN container startup costs (which is reduced to a minimum with latest LLAP versions)
- Presto is designed to federate queries over several different data sources (NoSQL, HDFS, DBMS, S3, Kafka, Kudu) using an approach other than Map Reduce.
- IBM Big SQL (paid product), an alternative SQL-on-Hadoop, claims to deliver 3 times better output compared to Presto and Spark with support of concurrency levels around 100 and medium latency in query response times.
Benchmark to compare Hive LLAP vs Presto:
The above benchmark (done by Hortonworks) is used to assess the Star Schema which is the typical schema used in OLAP workloads and interactive business intelligence queries based on TPC-H35 on 50 GB data.
The Hive LLAP 3.0.1 performance is much better than Presto on big data sizes and higher concurrencies.
IBM Big SQL may offer better performance on higher concurrency and bigger data sets.
So why SQL-on-Hadoop engines are not doing good enough and suffering under higher concurrencies?
Most SQL-on-Hadoop engines generally skip Map Reduce, instead of query data directly from HDFS. All of these tools share the same basic architecture, known as MPP full-scan. They have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still some latency overhead (e.g. SerDe time) associated with pulling data from the underlying storage layer into the computation layer.
SQL-on-Hadoop engines with their full-scan architecture will perform redundant scan work by reading the entire column of each query every time. Optimizations such as partitioning on a column(date, city, country) will not improve this as at any time a different column will be filtered/sorted.
Several workarounds exist such as denormalized star schema to prevent joins. But that does not allow incremental/daily data appending and rebuild the entire data set each time. Also, pre-aggregation, multiple data partitioning on replicated data are maintenance intensive and only short term workarounds.
The latest versions of Hive LLAP and Presto can query small data-sets (below 50 GB) with up to a few concurrent users without failures and query response times under some minutes, however as the data set and concurrency gets higher, they will not be suitable for interactive BI queries requiring sub-second response times. IBM BigSQL is a mature but paid product and may deliver 3 times better performance compare to Hive LLAP.
By default, SQL-on-Hadoop is a good fit for ETL and limited ad-hoc/interactive queries run by end-users such as data science workloads.
If SQL-on-Hadoop is evaluated against the criteria set above, it will satisfy most of the criteria except high concurrency, high performance for interactive BI workloads.
Virtualized in-memory File System
In-Memory Virtualized Data Access solutions would be a complementary approach to SQL-on-Hadoop with the help of Virtual File System, such as Alluxio.
Alluxio (paid product) offers the Virtual File System where the data on HDFS, S3 is accessed through Alluxio as shown below. Hive can be configured to read all or partial data through Alluxio’s in-memory cache (determined based on partitions)
Virtual File System can help improve the performance up to 5 times approximately.
When SQL-on-Hadoop & Virtual File System together is evaluated against the criteria set above, it will satisfy most requirements except the fact that the concurrency is expected to still remain below 50+ for sub second level response times on limited data sizes (up to a few Terabytes)
Proper benchmarks should be performed to prove the value of this product where there are no pre-aggregated cubes, indexes and only caching on a common data format is supported.
Conclusion
As a rule of thumb SQL-on-Hadoop is only good for interactive analytics with low concurrent workloads even if it is used together with Virtual in-memory File System (which has limitations to fit data into-memory).
Hence we need to continue our search for other components that might be suitable to serve enterprise BI workloads such as OLAP-on-Hadoop, etc…