Data Integration Architecture

Published in

Data Weekly by Jumpmind

3 min readJan 10, 2018

Not long ago, data grew slowly and was processed overnight in batch jobs. Today, the demand for data has exploded and rapid integration is viewed as a business requirement. Using open source software, we can build a data integration architecture that handles storage, movement, and analysis of data.

A data integration strategy must work across multiple platforms, accommodate a wide variety of data sources, and perform real-time processing of critical data. Let’s lay out an architecture using open source software and then walk through how it works together.

An Open Source Data Integration Architecture

Data commonly resides in operational databases that support applications. Open source relational database management systems (RDBMS) can meet a variety of general purpose use cases. These databases organize data into tables, columns, and rows, with access to data provided by the Structured Query Language (SQL). SQLite is a library that embeds inside an application for fast and efficient access to a file-based database. It is ideal for small footprint, single user applications, such as mobile devices. MySQL is the most common large-scale database known for its functionality and ease of use. As a standalone server, it provides more security, operational features, and data types. Its read access is very fast and it is often used as the back-end database for websites. PostgreSQL is the most advanced of these databases, with comprehensive standards compliance and reliable concurrency and transaction support. It can handle the most complex database designs and maintain data integrity under large loads.

The enterprise data warehouse is a specialized database used for business intelligence, including reporting and data analysis. Greenplum is an open source data warehouse, derived from the PostgreSQL database, that gained unique features for massively parallel processing, column-oriented storages, and handling of large datasets. These central repositories contain integrated data from disparate sources of data, storing current and historical data from multiple areas of the business.

The movement and integration of data is handled by two types of data integration: Change Data Capture (CDC) and Extract, Transform, Load (ETL) tools. SymmetricDS is a cross platform database replication server that continuously captures changes from source databases and loads them in near real-time to target databases. Its strengths are multiple database support, resilience over slow or unreliable network connections, and flexible configuration. A multi-database tool like SymmetricDS makes it possible to use the right database for the right use case, such as deploying SQLite on mobile devices, MySQL at the branch office, and PostgreSQL at central office, all kept in sync. Metl is a data integration server that uses ETL to clean and transform data into its final form for analysis. It can extract data from multiple sources, send the data through a transformation work flow, and then load the result to a destination. Sources and destinations can be files, databases, queues, or web services.

The data integration system immediately publishes events to an open source Big Data software stack for large scale analysis in near real-time. A popular Big Data stack is SMACK, which is an acronym for Spark, Mesos, Akka, Cassandra, and Kafka. Built on a cluster of commodity hardware, Mesos is the resource manager that distributes processing efficiently across the cluster. Kafka is a distributed messaging system that can receive incoming change events. Akka is the runtime for building highly concurrent, message-driven applications. Spark is a fast, general purpose system for performing parallel data analysis across the cluster. Cassandra is a distributed NoSQL database that is highly available and handles large amounts of data. Together, the SMACK Stack forms a pipeline for data analysis in near real-time that scales across the whole cluster in a fault tolerant way.

Open source software offers a best of breed approach for a data integration architecture that can handle the sheer volume and complexity of today’s data. In the architecture presented, we accommodated multiple data sources found in the enterprise and combined traditional data integration with a Big Data pipeline. We selected each database based on which one was best suited to the application. We processed data with bulk movement or event processing based on its importance and timeliness to the business. Open source software not only serves all our needs for data integration, but it’s also pushing the industry forward with innovative technology and techniques.

Data Integration Architecture

Written by Eric Long