Enterprise B.I in the age of A.I

6 min readNov 2, 2017

A.I is the talk of the town. Amidst this background our young startup has silently made strides towards addressing one of the biggest challenges with enterprise data: working on “Business” intelligence at scale.

More and more of enterprise data is moving to what are called as “Data Lakes”: Companies are in the process of leveraging these datalakes for a variety of analysis: from Operational Analytics to Reporting to Business Intelligence/Data Science and everything in between. There is a plethora of scale-out analytic platform vendors ready to help customers in their quest to develop these solutions.

Each tool, platform has its own technical capabilities, purpose built for each kind of analytic solution: for example, a Fraud Detection Solution’s platform needs are not the same as an Operational Analytics Solution. On the other hand there is enough commonality in platform needs to be enticed into a “use my favorite platform” for any problem domain, whether its a fit or not. Experience has shown us that it is easy to get started down such a path, but usually leads to a lot of disappointment.

Given this backdrop, we start by making the case of how a B.I. solution is distinguished from other kinds of Analytic Solutions, and what are its unique needs. This leads us to explaining the why and how of our SNAP offering.

In our conversations with customers, and having worked on a variety of analytics use cases over the past decade, we see the following buckets.

Operational reporting : Time-series trends, streamed events and real time need for reporting on “what happened”. Example: Active user count, clicks and impressions over last 7 days etc.

Batch/SQL reporting : Charting, Dashboarding on various aspects of business metrics. A Sales dashboard for example. Queries are simple projects, filters and scans.

Enterprise B.I OLAP: Advanced Insights into historical data to find patterns. Traditionally this was done using OLAP cubes with pre-aggregation designed to answer questions involving business hierarchies, allocation of budgets across multiple levels in hierarchy or Campaign Attribution, Forecasting and Planning with scenario analysis and multi-level, multi-dimensional slice and dice analysis.

Data science : Taking a sample of historical data and projecting trends or finding patterns based on feature modeling.

Let us take a look at each and see where they fit in and why.

Operational Reporting

Companies like Splunk, and products like Druid, Elastic Search have excelled at this kind of reporting. They are easy to setup and they track time-series events and do simple pivots on a limited set of dimensions and metrics.

The architecture and datamodel capabilities are driven by events such as in Adtech or log files. Historical analysis requires combining fact data with multiple dimensions and modeling these for analysis. Operational reporting is not setup to answer questions leading to insights. Many of these tools do not provide full SQL support for enterprise transactional data analysis.

Operational reporting has its place in a technology stack but it does not fit what is traditionally called business intelligence.

Reporting and Dashboards

Another set of Analytics usecases involve building Reports and Dashboards on top of the EDW( Enterprise Data Warehouse). This is generally called descriptive analytics. Here the focus is on SQL capabilities and SLAs around speed and cost of delivering Reports( mostly pre-defined with a few parameters). Fast SQL is a key platform capability for these applications. These are built on very rich SQL data models developed by EDW data architects.

When we started working on this a couple of years back, doing Tableau on Big Data was slow. Any query on Big Data with Hive or SQL on Hadoop tools was not Fast SQL. So we started down the path of leveraging the fast response times of tools like Druid with the power of a SQL based front end with a modern compute platform in Apache Spark.

The open-source Spark-Druid offering of SparklineData is an example of a platform to address this fast SQL reporting need, where OLAP Indexing from Apache Druid is combined with Spark SQL to provide a full-fledged SQL environment with the benefits of very fast slice-and-dice using an OLAP Index.

But we found that Spark+Druid did not fully address the needs when used to develop typical B.I. solutions like General Ledger Analytics, Travel and Expense analysis, Sales forecasting, Campaign Perfomance Management etc. where the requirement is more to derive insights rather than just reporting. Further companies that want to do exploratory data analysis using tools like Tableau need to work on large datasets with fast interactive responses.

The Federated Architecture of Spark + Druid has major drawbacks: the Druid Inverted Index Semantics is very different from SQL Semantics, Index Management is tied to Druid and not integrated into SQL, and critically, because of the federated nature and because the system doesn’t capture B.I level metadata, the scope of optimization is very limited. While Spark + Druid solved a specific need around fast SQL reporting it was not enough( whether its Druid or Elastic Search or any Search based platform, the issue is the same). Faceted search is not Enterprise B.I

Enterprise B.I

So that brings us to the key platform elements needed to support Enterprise B.I: The system must capture B.I metadata(Cubes, Dimensions, Hierarchies, KPIs) not just the SQL datamodel; the Cube data structure must be integrated into the EDW, and not surfaced as a separate system; the Optimization layer must leverage both the Cube structure and the B.I. metadata; and the runtime must have both SQL and B.I. building blocks.

The combination of Cube Data Structure + B.I. metadata + combined Optimization and runtime is the winning combination, unmatched by other Platform Architectures used to support BI.

How does SNAP address the requirements of fast, true B.I ?

SNAP is a Spark native B.I. platform: by that we mean we leverage all the goodness of Spark and add-in B.I. capabilities into it. We enable B.I. metadata: Cubes, StarSchemas, Dimensions, Hierarchies to be captured on Spark SQL tables; our Cube FileFormat enables structuring data in an OLAP Index that provides fast access and partial aggregation on slices of a large multi-dimensional Cube; we enhance Spark’s Catalyst layer to provide many Optimizations for B.I. Query patterns, for example Star-Join elimination, Eager and Partial Aggregation, Dimension/Hierarchy Semi-Join; finally we have enhanced the Spark Runtime to have Cube Operations and optimized access to Cube data.

Components of SNAP

SNAP has 4 components designed to meet the complex needs for reporting and enterprise B.I for Big Data as shown above.

The run time footprint is Apache Spark and operationally it is as simple as managing a Spark cluster. It can take data from HDFS/S3 or any Spark datasource and expose it to visualization tools like Tableau/OBIEE/Spotfire for on-demand adhoc reporting. It can also plug into Notebooks ( Jupyter/Zeppelin ) for datascience.

Summary.

Different analytics solutions require different platform capabilities. Force fitting solutions to platforms can lead to high cost and functional issues.

The combination of Cube Data Structure + B.I. metadata + combined Optimization and runtime is the winning combination, for B.I at scale.

SNAP on Spark provides a compelling platform to build B.I Solutions at scale.