Exploring Talend’s Real-Time Big Data Platform

Real-Time data processing is gaining more attention these days as it helps to make business decisions quickly. As a data engineer, I think this is the right time to explore the proficiency of ETL tools in processing real-time data. From the currently available ETL tools in the market, I decided to choose Real-Time Big Data Platform from Talend. Before taking a deep dive into the topic, I spent some time in understanding the data integration products offered by Talend for Big Data.

Talend’s Big Data Integration Products can be broadly summarized as below:-

· Open Studio for Big Data

· Big Data Platform

· Real-Time Big Data Platform

Understanding Open Studio for Big Data

The Open Studio for Big Data is built on top of Talend’s existing data integration solution (https://www.talend.com/products/data-integration/). It comes with an eclipse-based graphical workspace where the ETL jobs are created. In the Talend Open Studio, all you need to do is simply drag and drop the components from the Designer Palette to build the ETL job. The tool has an interactive and user friendly Graphical User Interface. The Talend Open Studio for Big Data can be considered as a super set of Talend’s Data Integration and Big Data features. The tool provides a wide range of 900+ connectors and this list includes connectors for Google BigQuery, Cassandra, Apache CouchDB, DynamoDB, HBase, HDFS, Hive and MongoDB.

Real-Time Big Data Platform

The in-depth understanding gained from Talend Open Studio for Big Data really helped me to perceive the real-time data processing achieved through Real-Time Big Data Platform. The volume of data is growing bigger day by day and the limitations of the conventional systems in terms of storage can devastate an organization’s data strategy and its roadmap.

I chose Banking as an example to investigate the real-time processing as the volume of transactions in this domain is comparatively high. The real-time manipulation of the data not only helps to detect fraudulent transactions but also provides the details about the nature and frequency of events for any given period.

The Real-Time Big Data Platform efficiently utilizes the big data framework Spark for processing the data in real-time. It is flexible in supporting both batch streaming and real time processing. The in-memory and high speed processing of data in Spark is a game changer for the data integration. It overcomes the limitations of the Hadoop MapReduce in terms of processing. It offers eclipse-based developer tooling and a job designer to design the Spark jobs. It also comes with 100+ drag-n-drop Spark components.

Talend also provides a Real-Time Big Data Sandbox- a ready to run virtual environment which includes the Talend Real-Time Big Data Platform, pre-built and ready to run real time scenarios. As a beginner in using a real-time data integration platform, I found this sandbox very helpful in familiarizing myself with the tool. The following are a few of the real-time use cases found in the sandbox: -

· Create a Kafka Topic to Produce and Consume real-time streaming data.

· Create a Spark Recommendation model based on specific user actions.

· Stream live recommendations to Cassandra DB for faster data access for a Web User Interface.

· Utilization of the build-in Spark engine in Talend Studio for Data Warehouse Optimization.

As an ETL developer, I was focusing on the use case of Data Warehouse Optimization as it clearly demonstrates the methods to stabilize a Data Warehouse. Before pumping the data into a Data warehouse, analytics is performed on these larger data sets to address issues such as Data Redundancy, Data Truncation, Missing Values, Data type Mismatch etc. There are three Spark jobs designed in the Talend Studio to achieve data optimization. When the first Spark job is triggered, it reads the data from a set of files present under a directory. The extracted data is then filtered and aggregated and a report is generated out of the cleansed data. The second job performs the comparison between the report generated from the previous job and the report created from the last month’s run to identify any anomalies in the data. The final job projects the comparison results through analytical dashboards (Google Chart, PowerBI etc.).

As I work in the Digital Banking domain, I can see how this tool would be useful in processing the operational and transactional data of the customers. The Real-Time Big Data Platform can perfectly serve as a platform to perform the data analysis and help to create unique, tailor-made experiences for each customer.

Conclusion

From an ETL developer perspective, I would recommend Talend’s Real-Time Big Data Platform tool given its efficiency in handling and processing huge volume of real -time events via the Spark engine. Also the user friendly graphical workspace and vast collection of connectors makes the job design much simpler.

Reference

https://www.talend.com/products/big-data/

https://www.talend.com/blog/2015/11/19/demo-combining-talend-6-spark-for-real-time-big-data-insights/

https://info.talend.com/rs/talend/images/CB_EN_BD_BigData_Insights_RealTime.pdf