Lightning-Fast analytics on Big Data using Tableau

Rahul Goel
Airtel Digital
Published in
4 min readSep 18, 2023

Data-driven decision-making is key to our customer obsession at Airtel and our constant strive to do better for our customers. At this scale, where we serve almost every third Indian, a large amount of data is generated by various customer actions, network nodes, and so on. With every moment of every system getting instrumented for improvement, this data is growing at a huge scale and there is a constant internal ask from our business users within Airtel to provide more agility to slice and dice into this humungous data and drive better insights for decision making. As a data platform, we have standardised Tableau as the tool of choice for our analytical visualisation.

Problem Statement

To have a data analytics and visualisation tool that can connect with various data sources and slice and dice data as per business needs.

Challenges

Scale: We generate trillions of data records daily and this much data requires special tools to store and process. Generally, all big data tools available in the market have a specific forte — some are great at efficient storage, some at computing, and those who claim to have all — are sometimes either too operationally painful to use or very expensive at our scale.

Performance: An interactive visualisation tool always has a certain expectation of response time. If a database responds to a query in 30 seconds or 1 minute, it is considered to be fast but if a front end does not respond in a few seconds, it is considered to be slow and leads to poor experience.

Solution

Dashboards are built to deliver insights through analysis storytelling. Dashboards usually follow a top-to-down approach i.e. top-down KPIs → raw data. Raw data is not always required, but usually when an anomaly or pattern is seen in the top-level KPIs, then for deeper root-level analysis raw data is required. Further, we require special tools to handle the extraction of the raw data. The regular tools like spreadsheets, that are available to an end user, are not apt for such heavy lifting of the data.

We need to go from strength to strength and harness the strengths of both the data layer and the self-serve platform to enable it. We need to adopt the Hot-Warm-Cold connections to the data approach in Tableau.

· Hot connections: These are the extract connections that power the summary and final KPIs. It responds immediately and extract time is also low.

· Warm connections: These connections power the aggregates and are comparatively slow. Their extract time is much higher.

· Cold connections: These are live connections to a data accelerator that speeds up the retrieval of raw data & focused raw data sets.

These Hot-Warm-Cold connections are married together using the cross-data source filters to enable the capability of drilling down from the summary view to the required raw data set.

Below is the proposed architecture of the same:

Definitions:

1. Raw Data

  1. Base layer of the data set
  2. Most granular data is either ingested directly or prepared after joining multiple data sets.
  3. Usually stored in the Hive tables
  4. Record size in millions
  5. Query performance through the Hive layer is slow. Therefore, it will be enabled through a data accelerator. It is known that in many cases, it would be a full scan and slow.

2. Focused Raw Data

  1. A subset of raw data with appropriate filtering of records & columns for less volume
  2. Stored in the Hive tables. It is partitioned as per the majority of the query filter from the dashboard use cases.
  3. Record size in a few hundred thousand
  4. Optimal query performance through a data accelerator

3. Aggregates

  1. Aggregates the raw data at required dimensions with appropriate filtering of records & columns for less volume
  2. Stored in both, Hive tables and Oracle Exadata
  3. Record size in a few thousand

4. Final KPIs

  1. Final well-defined KPI that is to be tracked in the dashboard. Dashboards show these KPIs directly than any manipulation
  2. Stored in both, Hive tables and Oracle Exadata
  3. Record size in a few hundred

Results

The results have been quite successful in enabling the self-serve analytics of end users. One of the largest sources of data for any telco is usage records. We have enabled the experience and fraud management teams to dig through these tons of anonymised usage data records of a customer with low latency and in a self-serve way for the system-identified potential frauds. This resulted in the direct prevention of revenue loss.

--

--