This work is based on Xavient co-dev initiative where your engineers can start working with our team to contribute and build your own platform to ingest any kind of data in real time.
All you need is a running Hadoop cluster with Kafka, Storm, Hive and HBase. You can deploy the application on the top of your existing cluster and ingest any kind of data.
DiP High Level Process Workflow
“Data Ingestion Platform utilizes the true power of the latest edge-cutting technologies in the big data ecosystem to achieve almost real time data analytics and visualization”.
DiP, scalable up to thousands of nodes, can take in data from multiple sources and in different forms to store it into multiple platforms and provide you the ability to query the data on the go.
DiP can take data from multiple sources, it gives you the ability to push data manually, upload files or a scheduler can be used to automate the workflow.
Multiple File Formats
DiP can process different file formats such as XML, JSON, CSV etc. using implicit data handling mechanism impervious to the client.
Easy to use UI
DiP comes with an easy to use, aesthetic user interface to start data processing.
DiP stores data in a lightning fast manner into multiple structured/unstructured storage platforms.
Visualize data in almost real time using different reporting styles like graphs, charts etc.
- Input to the application can be fed from a user interface that allows you either enter data manually or upload the data in XML, JSON or CSV file format for bulk processing
- Data ingested is published by the Kafka broker which streams the data to Kafka spout which acts as consumer across the topology
- Once the message type is identified, the content of the message is extracted and is sent to different bolts for persistence — HBase bolt or HDFS bolt
- Hive external table provides data storage through HDFS and Phoenix provides an SQL interface for HBase tables
- Reporting and visualization of data is done through Zeppelin
Source System — Web Client
Messaging System — Apache Kafka
Target System — HDFS, Apache HBase, Apache Hive
Reporting System — Apache Phoenix, Apache Zeppelin
Topology Builder — Apache Storm
Programming Language — Java
IDE — Eclipse
Build tool — Apache Maven
Operating System — CentOS 7
DiP Front End
Screen 1 — Use message box to feed data to Data Ingestion Platform
Screen 2 — Alternatively, upload files to feed data to Data Ingestion Platform
DiP Execution Flow
Below is a snapshot of DiP topology that runs across many worker nodes on different machines. The Kafka-spout passes the input stream to filter bolt, which transforms the incoming data and then other bolts persist the data into various systems.
DiP Data Visualization
Using Apache Zeppelin, data ingested in HBase can be viewed as a report/graphs by simply using phoenix interpreter which provides SQL like interface to HBase table. These graphs can be embedded to any other applications using JFrames.
Demo (Gautam Marya)