Red Query Tool (redQT)

Abhishek Sukhwal
redbus India Blog
Published in
3 min readMar 27, 2024

INTRODUCTION

Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.

Big data can be described by the following characteristics (5 Vs) :

  1. Volume : The size and amounts of big data that companies manage and analyse
  2. Value : The most important “V” from the perspective of the business, the value of big data usually comes from insight discovery and pattern recognition that lead to more effective operations, stronger customer relationships and other clear and quantifiable business benefits
  3. Variety : The diversity and range of different data types, including unstructured data, semi-structured data and raw data
  4. Velocity : The speed at which companies receive, store and manage data — e.g., the specific number of social media posts or search queries received within a day, hour or other unit of time
  5. Veracity : The “truth” or accuracy of data and information assets, which often determines executive-level confidence

The additional characteristic of variability can also be considered:

Variability: the changing nature of the data companies seek to capture, manage and analyse — e.g., in sentiment or text analytics, changes in the meaning of key words or phrases

Why there is a need for a big data querying tool ?

Following are the major challenges when we deal with big data processing :

  1. Capturing
  2. Curation
  3. Storage
  4. Searching
  5. Sharing
  6. Transfer
  7. Analysis
  8. Presentation

To address the above challenges we at redBus developed a big data query tool to extract useful information from data for different use cases

SOLUTION

redQT is a service tool which uses HDFS (Hadoop File System) and hive for querying the stored data using standard SQL

Following file formats are supported for storage :

  1. Parquet (preferred)
  2. Avro
  3. Sequence File (Ex: Binary image)
  4. CSV

Following tools/methods could be used to extract data :

  1. Power BI
  2. Database Tools (Dbeaver)
  3. Extracting data through hive connectors (Ex: Python hive connector)

ARCHITECTURE

Hive Internal Architecture

IMPLEMENTATION STAGES

  1. A requirement should be provided by the user to store the required dataset with desired schema and data frequency. A pipeline will be created in python or spark to ingest the data in HDFS. The user can also see the data in Hadoop UI
  2. Create a table in Hive pointing to the location of data in HDFS and create necessary partitions. The user can query the table according to his requirements.

USE CASE

As of now we have ingested data around ~4 TBs and it is still growing on a daily basis. Some of the information or datasets that can be extracted using the above tool are as follows :

  1. Customer’s searches data through which heat map could be generated and customer behaviour could be analysed for different regions
  2. Seat sale (number of seats getting sold on different routes) data for main and via routes
  3. Inventory data which can be analysed for optimising inventory levels, improving route planning, and enhancing overall operational efficiency

--

--