On-Premises Data Integration

Published in

SnapLogic Engineering

6 min readAug 20, 2021

An overview of the options for integrating data located On-Premises, behind a customer firewall, using an Integration Platform As a Service (iPaaS).

Introduction

We discussed in an earlier post about integration use cases in a modern enterprise and how an Integration Platform As a Service (iPaaS) can solve enterprise integration challenges. This post will drill down into the various approaches used by iPaaS vendors for supporting On-Premises data and the pros/cons of the various approaches. On-Premises in this case includes data residing on the customer’s own data centers and also endpoints on IaaS services like AWS/Azure/GCP not accessible over the public internet.

On-Premises Data Sources

With the rise of SaaS applications, many application endpoints are cloud hosted, meaning they are available over the public Internet. Traditionally, customer owned data centers were the main source of On-premises data, which is still the case for the largest Enterprises. Applications and databases on IaaS services like AWS/Azure/GCP are hosted on the cloud, but some of them are equivalent to a On-Premises data source since they are not reachable over the public internet due to security considerations. The On-Premises data sources include:

Applications/Databases in customer data centers: These are reachable from within the data center only, opening up access to the public internet is not possible or is highly restricted.
Applications/Databases in customer IaaS platforms: These are reachable from within the customer’s VPC network only. Opening to the public internet might be possible but is not recommended generally unless the application is designed to be open.
Endpoints with special authentication/authorization requirements: For some endpoints, the origin host sending the request is significant. For example, an S3 bucket or RDS database could be setup with IAM rules to restrict requests to specific hosts or networks.

For all the above, access to the data would be easier when the read/write requests come from within the same network as the data source.

Various Approaches to On-Premises Data

iPaaS platforms allow customers to build integrations easily, using the SaaS model to minimise the installation and maintenance overhead. The general architecture is to have a central control plane where the integration metadata is stored, exposing a UI to allow building integrations through a graphical interface.

The actual data processing happens on a data plane. The data plane fetches the data from the source endpoints, processes the data and then writes to the target endpoints.

Various iPaaS vendors take different approaches to handle on-premises data. Some vendors do not support on-premises data at all, limiting the usecases which can be supported using such products. Among the vendors which support on-premises data, the various approaches include requiring customers to open up firewall holes, installing custom network agent or allowing the data plane to run on the customer network.

Firewall Hole Approach: This approach requires the customer to open a firewall hole to allow access to the customer’s on-premises endpoint from the iPaaS provider’s network. The allowlist might be to a wide CIDR range containing all the data plane nodes hosted by the vendor, across tenants. For larger accounts, the allowlist might be a smaller range specific to the particular customer’s data plane nodes.
VPN Tunnel Approach: This approach requires the customer to setup an IPSec VPN tunnel on their network. The customer’s network operations team will have to setup the tunnel using the functionality exposed by the networking device used on-premises. Once the tunnel is configured, endpoints within the customer network are accessible from the iPaaS provider’s networks. There can be limitations like DNS not being supported, requiring the use of static IP addresses.
Custom Networking Agent Approach: This approach involves the customer installing custom networking agent software on premises. The agent will make outbound connections to the iPaaS provider using Websockets or HTTP long-polling. The outbound connection are used to proxy requests for data on endpoints within the customer’s network. The endpoint data is passed for processing to the data plane nodes at the iPaaS provider.
On-premises Processing Agent Approach: This approach involves the customer installing custom agent software on-premises, with the agent doing the actual data processing. The agent makes outbound connections to the iPaaS provider using Websockets or HTTP long-polling. The outbound connections are used for passing control requests to the agent. The agent will talk directly to the on-premises source endpoints, do the data processing and then talk to the target endpoints to write the data.

The various approaches can be classified at a high level as Data Processing at the iPaaS provider (the first three approaches above) and Data Processing On-Premises (the last approach).

Data Processing at the iPaaS provider

All data is pulled into the iPaaS provider’s network before being processed. The firewall hole approach, the VPN tunnel approach and the custom networking agent approach are in this category.

Pros

Limited infrastructure provisioning is required on-premises, only the VPN or networking agent needs to be provisioned. No data processing nodes are provisioned locally

Cons

No data locality, data is pushed out of the network before processing. This has severe performance and security implications.
Complex networking setup: The VPN tunnel approach can have complex configuration requirements. Each new on-premises location will have to go through this complexity. This approach does not work when the customer needs to send the request through an on-premises HTTP proxy.
Connector configuration can be complex. With the networking agent approach, the endpoint connection configuration can be complex due to the requests having to be proxied through the agent. Depending on what protocol is used to talk to the endpoint, like HTTP/JDBC/SFTP etc, only limited endpoints might be accessible through the agent.
Endpoints with special authentication/authorization requirements like IAM based access cannot be accessed without non-trivial configuration.

Data Processing On-Premises

The data processing happens on the customer’s on-premises instances. Data is not sent out of the customer’s network and data locality is maintained by doing the processing closer to the data source.

**On-Premises Processing Agent Approach**

Pros

Data locality means processing happens close to the data source. This has performance and security advantages
On agent software is installed, all on-premises endpoints can be accessed without any custom network settings being used.
Endpoints with special authentication/authorization requirements like IAM based access controls work as usual.

Cons

Infrastructure provisioning is required on-premises for data processing nodes
Agent software installation and software upgrades are required on-premises

Data Locality

There are various advantages to doing the data processing close to the location of the data. The advantages include

Performance: For use cases where data has to be filtered, doing the processing remotely would require all the data to be pushed to the remote location before the processing can be done. Not all data sources support push-down optimizations. The performance difference can be even more significant when the source and target endpoints are co-located.
Security: Doing the processing locally means that the customer data does not leave their premises, reducing the risk of security issues and make audits easier.
Isolation: Doing the processing on-premises has the advantage that processing can continue if there is a network disruption causing communication issues with the iPaaS control plane.

SnapLogic Solution for On-Premises Data

SnapLogic uses the on-premises data processing agent approach. For customers who do not have any on-prem sources, all processing can happen on data-plane nodes managed by SnapLogic. For customers with on-prem use cases, the SnapLogic agent software can be installed on the customer’s infrastructure. This provides all the data locality advantages listed above. To address the downsides of having to install agent software on-premises, the agent is designed with various features like:

Lightweight installation: Stateless agent software means the installation is lightweight, with no need to configured persistent storage
Resource management options: Support for various installation configurations including VMs, standalone Docker, Kubernetes and Mesosphere
Websocket based networking: The processing agents use Websockets to communicate with the control plane, enabling bi-direction communication through the customer’s firewall. Using the standards-compliant Websockets protocol ensures that the communication can go through proxies without any changes required on the customer’s network,
Automatic upgrades: Customer controls when the agent software gets upgraded. Customers can take a hands-on approach, deciding when the upgrades happen on each environment or take a hands-off approach, with SnapLogic automatically applying the upgrades.

With this approach, SnapLogic supports enterprise customers who have data processing spread across multiple data centers and IaaS services around the world. The control plane provides a single interface to manage and monitor all the integrations, with the actual data processing being done at the most suitable location. Uses cases requiring low latency access to endpoints like Kafka queues are possible with SnapLogic, which is not feasible with the approach of doing the data processing at the iPaaS provider

On-Premises Data Integration

Introduction

On-Premises Data Sources

Various Approaches to On-Premises Data

Data Processing at the iPaaS provider

Pros

Cons

Data Processing On-Premises

Pros

Cons

Data Locality

SnapLogic Solution for On-Premises Data

Written by Ajay Kidave