What are Integration Runtimes?

Azure services Data Factory, Purview, and Synapse Workspace all require the installation of a Self Hosted Integration Runtime in your network, what is it and why must I install it locally?

Carl Follows
Version 1
4 min readAug 4, 2022

--

Red Pipes dissappearing into the cloud
Photo by JJ Ying on Unsplash

Why do I need an Integration Runtime?

When deploying services in the cloud that require data from an existing on-premise network, you need to either move that data to the cloud or make it available to the new cloud service from its current location.

Any network will have a firewall to restrict connectivity from external services, and those that are allowed access to the network must be trusted by the network’s identity service if they are to be authorised to move data.

In practice, it gives administrators far greater control and allows for stronger governance if the data movement is happening within the network, authenticating using an identity controlled within the organisation.

This is what the integration runtime provides: compute power, running within your network, that is granted appropriate access to data sources and has a trusting relationship with the cloud service that controls it.

The cloud service passes instructions to the integration runtime, but it’s the integration runtime that connects, authenticates, and does the data manipulation & movement.

As well as solving network and identity management challenges, locating this compute within the on-premise network can also improve performance and ensures compliance with any geographic data governance policies.

How is it used?

In Azure, Data Factory is the most commonly used data movement service, comprised of pipelines that run activities against your data. The name pipeline may make you visualise data flowing through them. In truth, the pipeline is a sequence of orchestrated activities that are actually run across a number of Integration Runtimes.

The data factory sends instructions out to the Integration Runtimes which are responsible for the data manipulation & movement.

Software Architecture Diagram firewall & integration runtime
Integration Runtime moving the data

An important security consideration is how are the credentials to each data source entered and where are they stored. As we can expect from Microsoft this is well handled, the credentials being encrypted using Windows Data Protection Application Programming Interface (DPAPI) when the pipelines are published, then saved locally to the virtual machine hosting the Integration Runtime.

Data Factory is not the only service using the Integration Runtime.
Purview is a data catalogue that needs to gather metadata about each data source. This can be thought of as a specialist implementation of a data factory where the only data moving is metadata. Since this is standard for each type of data source, no further configuration is required beyond setting up the integration runtime.

How many do I need?

The answer is generally the same as how many networks you have. More accurately, consider that most data movement is between networks, so each Integration Runtime will need access to both the source and sink for the tasks it’s responsible for.

This is a lot easier to explain with some visuals, so let’s assume you are moving data out of your on-premise network and into a new cloud environment. As with all data movements you might need to do a bit of transformation or validation en route. For this let’s use Azure Synapse Analytics; Synapse is several Azure services (including Data Factory) combined into a platform for moving and manipulating data.

Software Architecture Diagram showing data flow & integration runtimes
Synapse Workspace highlighting Data Factory and the Integration Runtimes

Here you can see that there is an Integration Runtime installed within:

  1. The source network to extract data.
  2. The sink network moves data to its destination.
  3. The Synapse Analytics network for any transformation activities.
    This is called auto-resolved since it doesn’t need to be managed separately.

Even though the sink is in the cloud it will be in a discrete network behind a firewall so still requires an Integration Runtime for Data Factory to access it.

Can it connect to anything?

There is a huge list of data sources & services supported by the Integration Runtime, but some of this additional software may be required on the virtual machine where it’s installed, For example :

  • Oracle Data Access Components (ODAC)
    if you need to interact with an Oracle database
  • Java Runtime Environment (JRE)
    if you need to read/write parquet file format

What if we need scale?

If there’s a lot of data to move, or there is a time-critical requirement, then there may be a need to increase the computing power available to the Integration Runtime, there are a few options for this:

  • Scale-Out the Integration Runtime by deploying across up to 4 machines (nodes) to share the load.
  • Scale Up the host virtual machine CPU or RAM.
  • Scale Up the Integration Runtime level by allowing more jobs to run concurrently.

The choice depends on the volume of data, number of sources, and frequency of movements, but you must also consider the capacity of the data sources to provide the data. If the data source is constrained, then increasing the capacity of the Integration Runtime will provide little return.

About the Author

Carl Follows is a Data Analytics Solution Architect at Version 1. Follow our Medium Publication for more Data blogs, or visit www.version1.com to find out more about our services.

--

--

Carl Follows
Version 1

Data Analytics Solutions Architect @ Version 1 | Practical Data Modeller | Builds Data Platforms on Azure