Four Benefits of Data & AI Integration

Sophie Jin
IBM Data Science in Practice

--

Imagine laptop hard drives lined across the length of four football-fields. That’s how much data the average enterprise will collect this year.

Now, imagine 400 different wires all connected to each of the thousands of hard drives. That’s how many sources the data could be coming from. It would be infeasible if you were tasked with walking through the field, scouring for a tiny slice of information in the thicket of cables, then asked to interpret the impact of that across the entirety of the data. It would be much more accurate to take a bird’s eye view of the entire field. So why haven’t we empowered our data engineers, business analysts, or shadow IT to do the same?

This is where data integration comes in. IBM has defined data integration as “the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information.” In the modern day, enterprises have connected to hundreds of specialized sources in order to maximize their data usage. However, that also becomes a problem when considering just how many distant, siloed sources contain this data. This is where the importance of a “data fabric” comes into play along with data integration; data fabric is the architecture in which data can be accessed, governed, and managed within an organization regardless of how far apart that data may be.

A data fabric delivers flexible integrations, as different use cases require different architecture. In a time when the terabytes of memory and sources of data are compounding annually, a data fabric is more important than ever to ensure enterprises perform at their best by getting value from integrated data sets.

There are numerous integration processes that exist in the specialized world of data integration. To allow your enterprise to take advantage of the possibilities, let’s go over the most common forms of data integration, and the data fabric architecture behind them.

Photo by Maarten Deckers on Unsplash

1. Consolidation (Warehousing)

Data consolidation, also known as data warehousing, occurs when data is aggregated from separate systems and brought together in one database. Doing this can drastically reduce the number of storage locations. In addition, there may be tools as part of the data fabric in order to query all sources in one place. Think of this as the traditional databases, server rooms, and more, where data can be stored and retrieved, cleaned or formatted.

2. Data Manipulation (Extract, Transform and Load & Extract, Load, and Transform)

Data manipulation — Extract, Load and Transform (ELT) and Extract, Transform, and Load (ETL) — occurs when specific datasets from specific sources are extracted, then transformed into or manipulated, then loaded into a different storage repository. ETL tools allow enterprises to select which data to change in a controlled manner and to gain insights. Generally, ETL is similar to ELT, but with ELT, data is loaded into a larger system and later transformed. Both data manipulation techniques have different use cases. ETL works better with masking personal identifying information (PII). ETL works better with masking PII because it removes all personal information before many more people can view it. ELT works better and faster with unstructured data because it is not sent to a separate database for restructuring.

3. Propagation (Change Data Capture, Streaming Integration, and Data Replication)

Data propagation is when data is moved, replicated, or changed from one main source to other sources. Change data capture, or CDC, is when this occurs with the low latency of milliseconds. Suppose something changes in the main database, then CDC architecture sees this change down the line. Data Replication has a similar goal to CDC in that the data is maintained in multiple sources. CDC uses the broadcasting data integration pattern, wherein one “source” database’s change is reflected in many other tangential databases.

Other similar forms of propagation may be not as instantaneous but more energy efficient like batch style processing. Last but not least, Streaming integration is an option with less latency. Streaming integration allows near real-time dashboards to be created, and that data to reflect decision making. All of the aforementioned strategies are a “one-to-many” blueprint, with one main source to many tangential sources. However, data replication will make many instances of the original database in different environments to ensure that there are back ups of the data. With data replication, it can be a bi-directional sync, where copies of the data are compared to one another for a third result.

4. Virtualization and Data Federation

Data virtualization is similar to data consolidation except it focuses on data access and usability in one place instead of needing to move it to another location before use. This methodology is best used with small datasets and smaller projects.You can also take advantage of virtualization if you’re working with sensitive data but don’t want to move into another location to expose another point of access. Data federation takes this process a step further in making the data even look heterogeneous and systematic, although the actual data is not solely stored in that repository. Often times, this is used if actual consolidation of databases is cost prohibitive. Data federation is access provisioned on a need to know basis. In addition, not all of the data needs to be synced, meaning that unnecessary information can be left out.

Conclusion

Understanding the possible methods, patterns, and architecture of data integration, is an important step in helping your organization take advantage of growing data. Since data integration is one part of a complex system of data management, making sure that your data integration approach works well with the rest of your data fabric is key. Learn more about how you can use a data fabric to facilitate better data integration.

Photo by Kimon Maritz on Unsplash

--

--