DATA VIRTUALIZATION: Accessing and Integrating Distributed Data Sources.

Oloyede Rasheed
4 min readJul 18, 2023

--

Data virtualization is a technique that allows you access and integrate data from different sources without moving or copying it.
This is the integration of data from multiple sources without moving it physically. Data remains in its original sources while users access and analyze it virtually via special middle-ware.

However, Understanding Data Virtualization will not be complete if we do not know what Data Consolidation is.

Data Consolidation

Data consolidation involves combining desparate data within a single repository usually a data warehouse through the Extract, Transform and Load process. (ELT)

Whether ETL or ELT, the basic concept remains the same: Massive volumes of data from many disjointed sources are copied to a new, consolidated system, experiencing transformations somewhere along the way.

Data virtualization benefits and limitation

Real-time access

Real-time access: Instead of physically moving data to a new location, data virtualization allows users to access and manipulate data from source through the virtual layer/middle-ware. ETL in most cases is not necessary.

Information is instantly available for a broad variety of reporting and analysis functionality, greatly accelerating and improving the decision making process.

Low cost

Implementing data virtualization requires fewer resources and investment compared to building a consolidated data storage

Enhanced Security

Data doesn’t have to be moved anywhere and access level can be managed thereby eradicating data leak or loss through physical extraction or transformation

Agility

All enterprise data is available through a single virtual layer to different users and a variety of use cases.

Designing and performing BI analysis and reports can be done easily without worrying about data formats and/or where data resides

Consistent and Secure Data Governance

Having one data access point, instead of multiple points for each department, delivers simple user and permission management with full GDPR compliance.

KPIs and rules are defined centrally to ensure company-wide understanding and management of critical metrics.

Global metadata improves the governance of high quality data and delivers a better understanding of enterprise data through data lineage and metadata catalogs (depending on the tool).

Mistakes are detected and resolved quicker with data virtualization compared to other data integration approaches because of the real-time access to data.

Here are some limitations to data virtualization

Single point of failure

Since the server provides a single access point to all data sources, this might result in a single point of failure. If the server is down, the risk of all operational systems being down with no data feed is high.

Batch Data is not supported

Batch data is a timed or schedule collection of data. Data virtualization doesn't support this type of bulk data movement that may be needed in a number of cases. An example is financial organization that requires analyzing large volumes of data once a week.

So, when is data virtualization a good idea? When should an organization consider using data virtualization compared to ETL.

Data virtualization can be a good alternative to ETL in a number of different situations.

  • Physical data movement mostly via ETL is inefficient, difficult, or too expensive.
  • A flexible environment is needed to prototype, test, and implement new initiatives.
  • When Data needs to be available in real-time or near real-time for a range of analytics purposes DV may be considered an alternative to ETL.
  • Multiple BI tools require access to the same data sources.

When data is delivered for analysis, data virtualization can help to resolve privacy-related problems. Virtualization makes it possible to combine personal data from different sources without physically copying them to another location while also limiting the view to all other collected variables.

There are three building blocks comprising the virtualization structure, namely

  • Connection layer — a set of connectors to data sources in real time;
  • Abstraction layer — services to present, manage, and use logical views of the data, sometimes known as Virtual or Semantic layer
  • Consumption layer — a range of consuming tools and applications requesting abstract data.
Connection layer — Data source | Abstraction layer is the middle-ware | Consumption layer -Data consumers

Some Data Virtualization Tools

  • Denodo
  • IBM Cloud Pak for Data
  • Data Virtuality
  • Informatica Power Center
  • TIBCO

I do believe that a smart combination of the data virtualization process and tools alongside the ETL process can be considered in other to deliver and make data readily available for all data use case. Also, a lot more that has to be put into consideration has been written in this article examples are cost, speed of data delivery, how often data needs to be used and so on. Thanks for reading……cheers.

Want to learn more about Data Virtualization? Some References;

Data Virtualization: The Complete Overview — Data Virtuality

Data Virtualization: Process, Components, Benefits, and Available Tools — Altexsoft

Data Virtualization for Dummies by Lawrence C. Miller

--

--