Hybrid Cloud/On-premise Deployment — Using Controller/Agent Architecture to Enable Communication with On-premise Resources

Jakub Moravec
MANTA Engineering Blog
6 min readAug 1, 2022

Transforming an on-premise software product into a cloud-native SaaS application brings many challenges. This article focuses on the most immediate one: how to preserve interaction with on-premise resources.

MANTA’s data lineage platform provides customers with lineage information generated from the actual code of the source systems. Our scanners connect to various parts of customers’ environments, automatically gather all metadata, and reconstruct complete lineage. The source systems are most typically databases, reporting, and ETL tools, some of which are SaaS offerings (e.g. Snowflake, Tableau, StreamSets). However, there are still a lot of enterprise on-premise solutions that our customers also need lineage support for (e.g. Teradata, Cognos, Informatica PowerCenter).

Once we deploy MANTA instance into the cloud as an SaaS solution, our challenge is to support the extraction from the systems listed above and similar systems, which are typically well-hidden from the world outside of the customer’s infrastructure.

Problem Complexity

Several challenges have to be considered to evaluate the complexity of the described goal of maintaining interaction with on-premise resources.

The first one is communication protocols. The technologies listed above nicely illustrate the diversity of the communication protocols that have to be supported by the solution. We connect to Teradata using JDBC, Cognos metadata can be extracted using a REST (over HTTP) API, and PowerCenter can be only accessed by a client application (executed via shell) installed on the same machine as PowerCenter. These are just three examples of the fifty plus technologies that we support and already a solid portion of the relevant communication protocols need to be covered by the solution.

Another challenge is connectivity and security. We need to be able to connect to the source systems hidden in customers’ intranet from the cloud and the connection has to be established in a manner that will be compliant with customers’ security requirements. Asking the customer to set their firewalls to allow inbound traffic from the internet to required ports clearly doesn’t satisfy the latter condition.

Even if we manage to establish the connection with the source system, the job is still not done. The extraction process itself is performance sensitive. The details are technology-specific, but regardless of whether the extraction is done over JDBC or REST, it typically consists of a lot of requests to the source system. Communication latency and the geographical position of the source system and extractor significantly affect the overall extraction time. The output of the extraction can be a significant amount of data that has to be effectively and reliably transmitted to the SaaS application.

The last–but definitely not least–challenge is the cost of ownership of any component installed in a customer’s environment. One of the reasons why SaaS deployment is profitable to software companies is that the customers no longer need heavy infrastructure and dedicated IT teams to operate the on-premise applications. Any solution must thus not incur such needs–because if it does, it mitigates the benefits of SaaS deployment in the first place.

Choosing the Best Solution

At the moment, there are no clearly defined best practices to solve this problem.

Some solutions (e.g., AWS Direct Connect, Azure Logic Apps Integration Service Environment) utilize networking concepts and technologies such as VPN and VLANs. The main limitation of this approach for our use case is the network round-trip between the extractor and the source system, which impacts the overall performance of the process and leaves us without optimization possibilities. The same applies to thin client solutions (such as Hybrid Data Pipeline), but in this case, there is also another disadvantage which is that such client needs to explicitly support the technology that we need to connect to, or at least the protocol (in case of JDBC).

With these two approaches being not feasible for our use case, we decided to design our own solution — a thick client that would be able to actually perform the whole extraction and would only send the outputs to the rest of the MANTA platform. This design implies that basically, the whole solution has to be engineered, so the idea of using an existing third-party solution doesn’t make any sense anymore. So, let’s see what design choices we do have.

The first one is the architecture and deployment of the application (let’s call it agent) itself. Some vendors decided to deploy these on-premise agents on top of a Kubernetes cluster. While this approach has clear benefits, such as being able to operate the agent in high availability mode, there are also some drawbacks. Kubernetes cluster has to be already installed on the host or shipped with agent deliverables.

The resources needed for the cluster to run are also not negligible, and should any manual intervention from customer administrator be needed during debugging, cross your fingers and hope that administrator is a Kubernetes guru. Sure, typical IT departments would have such people on their teams, but remember that the customer wanted a SaaS service so they don’t need a huge number of highly skilled IT administrators. Which factors are more important? That’s for every vendor to decide. For us, lightweightness and simplicity is more important at the moment, especially as we are able to easily restart the whole extraction process and spawn new instances on the same host if needed even without Kubernetes.

Once we know how the agent will be deployed, the second big design choice is how it will communicate with the rest of the platform. Let’s review the requirements. The agent needs to perform the extraction in reaction to requests received from the SaaS application. At the same time, there can be no inbound communication to the agent, as it would be stopped by firewalls and other networking and security measures. Another aspect is the amount and the (potentially high) volume of the outputs that have to be shared with the SaaS application. Various technologies can be used, we have considered specifically: reactive streams, gRPC, and messaging.

While all of these technologies are well enough equipped to cope with the data volumes and performance expectations that we have, messaging utilizing a publish-subscribe pattern elegantly copes with the restriction on inbound communication. The messaging client, whether acting as a producer or consumer, is the one that creates the connection with the messaging broker. As a result, the client itself does not have to expose any port publicly. That way, it can receive commands from the SaaS application, while there is no inbound communication to the agent. Moreover, no workarounds such as polling are necessary to achieve that. Also, because MANTA already has a messaging broker well-established in our platform, using it for this use case was an easy decision. Another big benefit is that this will enable easier scaling of the solution.

With these two questions answered, we already have a good understanding of how the high-level design of the solution works.

Controller/Agent Architecture

If we take a look back at the design and try to describe it in the terminology of software engineering, the one architectural pattern that we can match it with is master-slave architecture.

As you know, there has been a lot of discussion about the name of this pattern during the past few years, and a lot of alternative naming conventions have been introduced. I want to use this opportunity to think about how much sense that makes from the technical perspective.

Let’s take a look at the first two alternative names listed in Wikipedia, for example:

  • Controller/Agent
  • Primary/Secondary

It is obvious that both terminologies describe a very different type of software system, with very different non-functional requirements. Primary/Secondary suggests that the system is utilizing some sort of (data) redundancy and/or fallback mechanism, while Controller/Agent makes me think of workload distribution to geographically dislocated nodes (because an agent is used, as opposed to simply using a more general worker). This paints a picture of two completely different systems and that is, in the end, one of the reasons why we tend to create standards and patterns: to be able to imagine how an unknown system works just by saying the architecture/pattern/standard name.

It’s probably a good thing that we (software engineers) have started re-evaluating some of these terms that are now half a century old — maybe the driver was external, but it seems that the outcome might be new to-become standards that better reflect the current state of the software world.

--

--