Data Integration Principles

Published in

Sysco LABS Sri Lanka

12 min readSep 15, 2023

How data integration can empower and enable digital transformation

Data integration is the process of combining data to derive meaningful information and business insights. This plays a major role in each and every organization as the right information at the right time, leads to right decisions which help improve the customer journey. In summary, data integration is a key to aligning the information system with the correct business strategy. Let’s discuss some basic concepts of this area along with several design patterns and integration mechanisms.

Importance of Data Integration

Seamless Data Flow
Data integration architecture facilitates the seamless flow of data across various systems, applications, and databases within an organization. It ensures that data is accurately captured, transformed, and delivered to the right stakeholders at the right time, enabling informed decision-making and business operations.
Unified View of Data
enables a holistic understanding of the organization’s data assets, improves data quality, and helps identify relationships and patterns that might otherwise go unnoticed.
Data Consistency and Accuracy
By enforcing data validation rules, data cleansing, and transformation processes, the architecture ensures that data is consistent, accurate, and reliable across different systems and databases.
Enhanced Data Governance and Compliance
Promotes data governance by establishing clear rules, standards, and processes for data management
Agile and Scalable Data Infrastructure
Supports various data integration techniques, such as batch processing, real-time streaming, and event-driven architectures, enabling organizations to adapt to evolving data requirements and business needs.
Improved Efficiency and Cost Savings
By reducing data duplication, streamlining data movement, and optimizing data processing, data integration architecture helps improve overall operational efficiency. It eliminates the need for manual data reconciliation, reduces redundant processes, and minimizes the costs associated with maintaining multiple data silos.
Enable Advanced Analytics and Insights
By integrating diverse data sources, it enables organizations to derive meaningful insights, discover hidden patterns, and make data-driven predictions, ultimately driving innovation and competitive advantage.

Modern data integration-Data engineering by StreamSets

Data Integration Architecture Types

There are various types of data integration architectures, each with its own characteristics and use cases. Here are some commonly used types:

Batch Integration Architecture

Data is collected and processed in batches at scheduled intervals. It involves extracting data from multiple sources, transforming it, and loading it into a target system or data warehouse. Batch integration architecture is suitable for scenarios where near-real-time data processing is not required, and data can be processed in regular intervals, such as overnight or at specific time intervals.

2. Real-time Integration Architecture

Real-time integration architecture enables the continuous flow of data in near-real-time or real-time from source systems to target systems. It involves capturing data changes as they occur and immediately propagating them to the target systems. This architecture is commonly used in scenarios where immediate data availability and responsiveness are critical, such as online transaction processing (OLTP) systems or real-time analytics.

3. Message-Oriented Middleware (MOM) Architecture

MOM leverages message queues or message brokers to enable data integration. Data is sent as messages from source systems to target systems through the middleware. This architecture provides asynchronous and reliable communication between systems, allowing decoupling and scalability. MOM architecture is commonly used in distributed and event-driven systems.

4. Extract, Load, Transform (ELT) Architecture

ELT involves extracting data from source systems, loading it into a target system or data lake, and then performing transformations directly within the target system. ELT leverages the processing power and scalability of the target system to perform complex transformations, as opposed to traditional extract, transform, load (ETL) architectures where transformations happen before loading the data. ELT architecture is well-suited for scenarios with large volumes of data and where the target system can handle the transformation processes efficiently.

5. Federated Integration Architecture

Data remains distributed across multiple systems, and integration occurs virtually by querying and accessing the data in real time without physically moving or replicating it. This architecture provides a unified view of data without the need for centralizing or consolidating it. Federated integration architecture is useful in scenarios where data resides in multiple autonomous systems or when data sovereignty and privacy regulations need to be adhered to.

6. Hybrid Integration Architecture

Hybrid integration architecture combines multiple integration approaches to cater to different data integration needs within an organization. It involves a mix of batch, real-time, and other integration patterns to accommodate a wide range of data sources, processing requirements, and latency constraints. Hybrid integration architecture provides flexibility and scalability by utilizing the most appropriate integration approach for each specific use case.

It’s important to note that these architectures are not mutually exclusive, and organizations often implement a combination of them based on their specific requirements and data integration scenarios.

Orchestration and Service Composition

There are a set of features that are commonly used in designing a system.

Multi-step process: This enables multi-step business process modeling capability to encompass calls to multiple back-end systems on the same transaction or flow.

2. Conditional transitions: This defines transitions between process steps that are configurable on conditions.

3. Looping: loop over an activity or a set of activities with a stop condition

4. Scheduling: Defines a scheduled process with a configurable wake-up time

5. Wait/Sleep: Performs a temporary configurable suspension of the running call flow.

6. Throttling: Define a backend or an external API with a maximum call limit.

7. Multicast: Propagate a piece of information to multiple endpoints in parallel.

Data Storages

In data integration architecture, data storage plays a crucial role in ensuring efficient and effective data integration processes. Here are some common data storage components used in data integration.

Database — The simplest and most familiar way to store data. Relation (SQL) and non-relational (NoSQL) databases are available.
Data Warehouse — Adds a dimensional level to the data structure to show how data types relate to one another.
Object Storage — This approach involves storing vast quantities of unstructured data, including sensor data, audio and video files, photos, and more, in their original formats. These data are kept within self-contained repositories that encompass the data itself, along with its associated metadata and a UID number. With the aid of this metadata and ID number, applications can effectively locate and access the desired data.
Data Lake — Data lakes serve as centralized storage systems, typically leveraging object storage, where raw and unstructured data is accumulated for future utilization and transformation. These reservoirs house immense volumes of diverse data types, facilitating the processing of big data and enabling the application of machine learning and AI techniques.
Data Warehouse — To depict the interrelationships between different data types, an additional level of dimensionality is incorporated into the data structure. This often necessitates a transformation process to prepare the data for utilization within an analytics system. By introducing this dimensional aspect, the data becomes more organized and better suited for analysis purposes.
Data Lakehouse — The concept of a “Data Lakehouse” involves creating a unified platform that integrates the capabilities of both data warehousing and data lakes. By incorporating data warehousing data structures and management functionalities into data lakes, this approach offers several benefits. It helps reduce storage costs, minimize data redundancy and movement, and streamline administration tasks, resulting in more efficient data management and utilization.

Data Transformation

The simplest definition is the conversion of data from one format to another. This can vary in complexity as basic, intermediate and complex transformations. I’m not going to explain these models in detail here but let’s see some examples of its usages.

Basic/simple data transformation
data type conversion, data format conversion, basic operations
Intermediate data transformation
lookup, aggregations, sorting, deterministic search etc
Complex data transformation
probabilistic search, custom functions etc

Data Integration Patterns

Data integration patterns are commonly used approaches or strategies to address specific data integration challenges. These patterns provide guidance on how to design and implement data integration solutions. Here are some common data integration patterns:

Point-to-Point Integration

Connects two applications directly, no data integration layer is required.
Simple architecture since there is no need for middleware tools.
The main drawback is tight coupling and orchestration can be a nightmare.

Hub and Spoke

Involves using a central data hub or integration layer that serves as a mediator between multiple source and target systems
Data flows from the source systems to the central hub, where transformations, mappings, and validations occur before being distributed to the target systems
This pattern simplifies integration complexity and promotes consistency and reusability.

ESB

An Enterprise Service Bus implements a communication system between mutually interacting software applications.

Suitable for

Real-time integrations
Low data volumes
High frequent calls
Standard integration pattern
Can address many usages such as mobile applications, IoT and B2B integrations

But

Not always available with legacy systems
One-off cost to manage impact on external systems to build the interfaces.

ETL

Mainly used in batch mode. Generally, these operations are scheduled.
Highly used in Business intelligence and data warehousing-related use cases.

EDI

Electronic Data Interchange was introduced in the 70s and mainly used in automated data exchange between trading partners (customers, suppliers) for business documents. Purchase orders, invoices, payments and shipment information are some examples of data types.

Change Data Capture (CDC)

Captures and propagates data changes from source systems in near-real-time or real-time.
Identifies and captures the changes made to the source data, such as inserts, updates, and deletes, and applies those changes to the target system.
Commonly used for real-time integration scenarios.
Useful when maintaining synchronized data across multiple systems.

Principal data flow for CDC by Microsoft

Virtual Data Integration

Provides a virtualized layer that abstracts and integrates data from multiple source systems without physically moving or replicating the data.
It allows users and applications to access and query the integrated data as if it were in a centralized location.
Virtual data integration helps avoid data duplication and provides a unified view of data without the need for extensive data movement.

Publish-Subscribe

Involves the publication of data by a source system and the subscription of interested systems or components to receive the data.
When new data is published, all subscribed systems or components receive the data.
Facilitates loose coupling between systems and enables event-driven integration scenarios.

Connectors

Connector is a software layer that creates a link between applications to allow the transfer of data. This is one of the major topics in data integration since it determines the capability of a data integration layer to connect to various applications and data sources. A wide variety of connectors are available to handle a multitude of applications and protocols.

Database Connectors: These connectors are designed to establish connections with relational databases, such as Oracle, MySQL, Microsoft SQL Server, or PostgreSQL. They enable the extraction of data from databases, execute queries, and load data into target systems. Database connectors often use standard protocols like ODBC (Open Database Connectivity) or JDBC (Java Database Connectivity).
Application Connectors: Application connectors are specific to particular software applications or enterprise systems, such as SAP, Salesforce, or SharePoint. They provide integration capabilities to extract data, invoke APIs, and synchronize information between the application and other systems.
File Connectors: File connectors are used to read and write data from various file formats, such as CSV, XML, JSON, or Excel. They facilitate the integration of data stored in files or the exchange of data between systems through file-based interfaces.
Web Service Connectors: Web service connectors enable communication with web services using standards like SOAP (Simple Object Access Protocol) or REST (Representational State Transfer). They allow data exchange and integration with systems that expose web service interfaces.
Messaging Connectors: Messaging connectors facilitate integration with message-oriented middleware or enterprise messaging systems, such as Apache Kafka, RabbitMQ, or IBM MQ. They enable the transfer of data and messages between different systems in an asynchronous and reliable manner.
Cloud Service Connectors: Cloud service connectors are designed to interact with various cloud platforms and services, such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform. They enable integration with cloud-based storage, databases, messaging, and analytics services.
Legacy System Connectors: Legacy system connectors are specific to older or proprietary systems that may not have standard integration interfaces. They provide connectivity to mainframes, AS/400 systems, or other legacy applications, enabling data integration and interoperability.
Real-time Streaming Connectors: Real-time streaming connectors facilitate the integration of data from streaming sources, such as Apache Kafka, Apache Flink, or Apache Spark Streaming. They enable real-time data ingestion, processing, and analysis.
API Connectors: API connectors allow integration with systems and services that expose RESTful or SOAP APIs. They enable data exchange, authentication, and interaction with third-party applications or web services.
Custom Connectors: Custom connectors are built specifically for unique or specialized systems, databases, or applications that require custom integration interfaces. These connectors are often developed using custom code or specialized integration frameworks.

Data Integration Tools

Tools will help to get the job done. Even though there are many modern tech stacks to be considered, it's up to you to select the optimum one for your requirements. Let’s check some of the famous ones.

Pipeline workflow management and orchestration

Apache Airflow, Dagster, Prefect — open-source pipeline workflow management
Jenkins, GitLab, CircleCI, Argo — DevOps orchestration tools
Kafka, Beam, Flink — streaming systems

Compute tools (Big data analytics)

Apache Spark — open-source distributed data processing system
Google Dataflow — managed data streaming system built on Apache Beam
DBT — SQL-based data transformation tool

Data ingestion tools

Fivetran — managed enterprise data integration tool
Singer ETL — JSON-based open-source data integration tool
Meltano ELT — open-source data integration tool
Airbyte ELT — managed mid-market data integration tool

Data storage

S3, Azure Blob Storage, Google Drive, or GCS — cloud-based data lake

Operations Management

Once the system is live, we need to set up a support model. The below set of questions needs to be addressed.

How to manage monitoring and alerts?
How to set up incident management?
How to maintain the environment over time?
Internal team? Support from a software vendor or outsourced support team?

Monitoring and Alerts

Identify the critical thresholds that should trigger alerts (e.g: CPU, Memory, Storage).
Ensure that alerts are triggered and reach the responsible contacts.
Ensure continuous monitoring is implemented (check that data flows are being executed at the right performance).

Incidents Management

Define end-to-end process for incident management (include actors, stakeholders and activities)
Prioritize the incidents, and adjust the process depending on that.
Define SLA (Service Level Agreement) to set up resolution target timelines aligned with business priorities and outcomes.
Track KPIs to measure operation performance.
Communicate to the right stakeholders.

Environment maintenance

Schedule the maintenance (identify and communicate the downtimes).
New service deployments, new patches, or software installations need to be scheduled.
Clean up the environments (remove old installation files, log files / deactivate users who are no longer part of the organization / Archiving, etc.)

Conclusion

Implementing data integration flows shouldn’t be too difficult. The most important part is the design, therefore make sure to select the best tools for the tasks. Always try to use existing tools and cloud services. Do not reinvent the wheel. There are many open-source and proprietary tools provided by vendors such as OpenESB, Talend, WSO2, TIBCO, Boomi, etc.