Challenges in Data Integration

Sambit Rath
4 min readJun 25, 2018

--

What is Data Integration?

Speaking colloquially, data integration involves methodical combination of data residing in different sources. Data Integration helps in providing a unified view of the disparate data, thus making it more valuable than the previous disparate versions.

IBM defines Data integration as: “the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information.”

Integration does not only mean moving data from one place to the other or combining data from several sources and pouring them into a single repository. Instead, it is all about making the data comprehensive and more usable.

Data Integration Types:

1) Data Consolidation

Data consolidation brings data physically from separate systems and creates a single version of the same by blending the data. Usually, data consolidation aims to reduce the number of data storage locations.

Data consolidation makes extensive use of Extract, Transfer, Load (ETL) technology. ETL pulls out data from different databases, transforms the same into an intelligible format and then transfers to another database or warehouse. The ETL processes clean, filter and transform the data; and then apply business rules before populating the new source with the data.

2) Data Propagation

Data propagation makes use of other applications to copy data from one location to the other. The process is event driven and is conducted in either synchronous or asynchronous mode.

For synchronous data propagation, both sender and receiver access data at the same time. It is a two-way data exchange mechanism between the source and target. The process is supported by technologies like EAI (Enterprise application Integration) and EDR (Enterprise data replication).

EAI is basically meant for integrating systems for exchange of transactions and messages. Integration platform as a service (iPaaS) is a variant to EAI integration. EAI is used for real time business transaction processing.

On the other hand, EDR is used for transferring large volumes of data across databases. EDR makes use of triggers and logs to keep track of data exchanges between source and remote databases.

3) Data Virtualization

Virtualization retrieves and interprets data from different sources to provide a real-time unified view of same. Data can be viewed in one location but they are not stored in a single location.

4) Data Federation

Federation is basically a form of virtualization. It makes extensive use of virtual databases and creates a common data model. Data federation is supported by EII (Enterprise Information Integration) technology which abstracts data from heterogeneous sources and provides a unified view.

Data Federation is preferred over data consolidation in case data consolidation is expensive or there are security and compliance issues.

5) Data Warehousing

Data warehouses are data storage repositories. The term warehousing implies cleansing, formatting, blending and storage of data; and this is same as Data Integration.

Design and Application Challenges in Data Integration

Design Challenges

1. Good Understanding of Data

Data Consistency is of paramount importance in any organization. It is very important to have a team of people who are passionate and understand the data assets of the organization and the source systems. These enthusiasts should be able to foresee long term data integration goals and lead discussion on the same, thus driving data consistency.

2. Clear Understanding of Objectives and Deliverables

One needs to have a very good understanding of the business requirements and the reason why data integration is initiative is being pursued. Also, care must be taken to understand the deliverables, the gaps between data and requirements, capability of source systems etc.

3. Analysis of The Source Systems and Extraction

A good knowledge regarding data extraction methods is critical. Things like quality of data in the database, volume of data to be extracted, frequency and extent of data extraction affect the direction and timeline of the project. It is essential to have a knowledge of backup schedules and specific maintenance windows if any, which might affect the process.

Implementation Challenges

Choosing the correct toolset for data integration amongst a lot of others is one of the most challenging tasks faced by a company. Making the decision is never a cakewalk considering the number of options available in the market. Hence a proper feasibility study is essential to map out the best suited option.

Sometimes, it might happen that an organization might have invested a lot in a tool that is no longer relevant or is not scalable. In such cases, even the matured organizations need to do a feasibility study to find the best suited toolset and to find what it takes to upgrade the data infrastructure to accommodate the new technology.

--

--