Efficient Pre-Processing and ETL offloading Using Apache Spark

Published in

Mindboard

2 min readMay 24, 2018

Objective: Utilize Informatica to process a series of data governance rules to create a “golden record,” enabling related resources for client information (MDM) to be utilized cross-agency.

Background and Solution: The Department of Human Services for the State of Maryland (MD-DHS) engaged Mindboard to assist in the development of an shared enterprise data repository (EDR). Shared EDR is purposed for clients participating in more than one state benefit programs across multiple agencies. The existing data collection methods require tedious monitoring of duplicates, data validation workflows, and do not efficiently leverage cross-departmental communication during acquisition. Due to the large amount of data processed by MD-DHS’s legacy systems, Informatica pre-processing transformations can run for half a workday or longer (in some instances multiple days) depending on the data being processed.

The solution designed alongside Mindboard’s enterprise data analysts, was to create a Python and Spark-based framework for ETL offloading addressing the performance gaps. Implementing the Pyspark process, Pysql and Pyspark dataframes were used joining data from multiple tables and apply transformations (some exotic) efficiently. The 5-node Hadoop cluster used for this effort reduced the Informatica processing times from multiple hours (high two-digits) to less than one (1) hour.

Further downstream processing and transformation using Informatica also noticed significant performance gains through directly loading pre-processed data into the MDM database using Spark. The distributed processing nature of Spark led to efficiency gains and an overall reduction of operational costs. Mindboard has continued to work with MD-DHS to investigate other ETL areas of improvement using this and other methods.

The result of this work has been the successful creation of data governance rules, which are being implemented against legacy datasets for creation of the “golden record” (MDM) and the implementation of APIs for new applications to connect to the golden record for enhanced client-services. APIs will be used cross agencies under the MD THINK initiative.

Efficient Pre-Processing and ETL offloading Using Apache Spark

Written by kaushik M