Unifying On-Premise Databases into a Federated Cloud GraphQL API

Published in

Yonder TechBlog

7 min readDec 26, 2023

A black and white diagram that displays a cloud hosted graphQL API that aggregates multiple databases. An user consumes this API — How can you access multiple on-premise data stores at the same time?

The need of hybrid cloud

In today’s data-driven landscape, some companies often find themselves in the situation of managing and querying multiple databases hosted on-premise. The reasons can vary from business requirements, multiple software acquisitions, legacy apps integrations and many more. While the on-premise model offers control and security (cough), concurrently interacting with diverse databases introduces a lot of challenges that organisations must handle correctly. In some cases it is pretty easy to migrate data to the cloud using a lift-and-shift approach but sometimes, and probably in the most of the cases it is business-wise to keep your data and infrastructure untouched. But let’s imagine a hybrid cloud scenario where you want to integrate these databases with a cloud hosted application or you built a mobile client that must consume on-premise data. In this article will explore a possible solution of solving this kind of problems by unifying multiple on-premise hosted database in a single cloud-deployed API.

Challenges

Let’s be honest, organisations vary in their technical proficiency, and not all of them achieve perfection in this regard. There are a lot use cases of applications that are still not in the cloud and there many reasons not to do it: migration costs, compliance, data residency requirements, good ol’ desktop applications and many more. But this doesn’t mean that your product is obsolete and being on-premise is a showstopper for you in terms of innovation. You can still build new functionalities or even companion apps that may extend your product lifecycle using powerful cloud services without the need of migrating the entire software product ecosystem in the cloud. That’s why the hybrid cloud has emerged as a versatile solution for these types of problems, yet it comes with its own set of challenges.

Hybrid Connection

For a hybrid cloud to work well, the connection between the cloud and the on-premise environment is the key and without it you cannot achieve a successful integration. For example large enterprise companies requires corporate networks that may interfere with the maintaining a reliable hybrid connection so there is need of additional complex configurations to achieve that. Some companies opt for site-to-site VPNs or public and internal firewalls but all these things require a lot of network and security configurations that might be costly and delay the integration.

Data Unification

Another challenge is how we can aggregate multiple data sources from multiple physical locations into a single API. Even if you create a secure connection between the cloud and on-premise environment you still have to implement a façade that should manage all the on-premise connections and to unify all the data in a single API, so the consumer can get data concurrently from different on-premise data stores in a single request. You can achieve that by integrating an ETL system that will extract and uniform all on-premise databases in cloud but this can rise different synchronisation problems that must be addressed. You can also build an API Gateway that is fetching from each service but development-time can be costly if you want to have something generic. In recent years, GraphQL, together with Apollo Federation emerged as a solution (called Federated Architecture) for this kind of problems. By using Apollo it is require that each on-premise database is exposing a GraphQL API. Now this can be a challenge because it requires additional on-premise components that should be developed and maintained.

In this article, I will present a method for establishing a secure and reliable hybrid connection that is easily configurable. The proposed solution involves implementing a GraphQL Federated Architecture, enabling direct communication with the database without the requirement for on-premise supplementary API layer.

System Design

Requirements

Before getting into technical details, I will illustrate a concrete use case to facilitate the explanation. Let’s imagine the following scenario:

Let’s suppose we have a tech conglomerate with a vision for innovation an growth called TechHub Enterprises, where their main mission is acquisition of various small and medium digital business. Among these acquisitions are “Northwind”, a thriving e-commerce platform and “Chinook”, a well-known vinyl store. This technological conglomerate acknowledge the need of a centralised application that acts as an unified hub for employee data and it be easily integrated with the recently bought companies databases. The goal for this app is to streamline HR processes, analyse employee data and apply data-driven insights to enhance organisational efficiency and decision-making.
As TechHub continues to evolve and integrate these business entities, a common challenge arrises — Each acquired company maintains its own on-premise database infrastructure, making it a complex task to get valuable insights from employee data and ensuring that employee records are up-to-date and accessible in real-time.

Based on the scenario we can extract some non-functional requirements:

Integrability: Integration with on-premise acquired companies data
Interoperability: Support different types of databases
Concurrency: Support concurrent data retrieval from multiple data sources
Ease of Configuration: Provide a solution that is highly configurable enabling rapid deployment and flexibility
Plug-and-Play: Enabling the addition of a new database without requiring extensive manual configuration or coding.

Proposed solution

Technologies

ngrok: Manages a worldwide network of servers called ngrok edge that accepts traffic to upstream clients from internet. The nice thing about it does not forward to IP addresses like a traditional reverse-proxy. In our case, you have just to run inside on-premise network, a lightweight software (ngrok agent) alongside with the database which connects to edge servers using a secure outbound persistent TLS connection. When the client sends traffic to an edge endpoint, it is redirected to the agent via TLS connection and the agent will communicate with your internal database using a TCP connection. This technology can be configured to support additional security features like IP whitelisting, mTLS, integration with OpenID, OAuth and many more.
Hasura: It is a GraphQL query engine that makes your data accessible via GraphQL API, just by providing a database connection string. You don’t have to implement the API by yourself. For this application, Hasura also helps to federate multiple data sources into a single API, so you can fetch data from different servers in a single request. Beside that you can configure security policies at data level, validate data & extend the API with business logic using the actions module or integrate with an existing authentication provider. You can see Hasura as a GraphQL oriented Backend-as-a-Service (BaaS) that supports connecting to a bunch of database types and existing APIs.

Implementation

In this section, I will present a high-level step-by-step guideline to get the technologies that I’ve described above, up and running. If you want to run in production, please study the documentations by yourself to configure and to apply the right security settings correctly.

Hybrid Connection using ngrok

Create an ngrok account
Access your authentication token

Can be accessed through ngrok dashboard user interface

3. SSH into the on-premise machine (where the database is hosted)

ssh myuser@onprem

4. Install & run agent using your preferred method (e.g. Docker)

Dashboard can offer you multiple ways to install the agent by choosing the platform you want

docker run --net=host -it -e NGROK_AUTHTOKEN=<YOUR AUTHTOKEN> ngrok/ngrok:latest tcp <DB HOST>:<DB PORT>

5. Verify the connection

Under “Agents” section, you can look over all active agents & endpoints.

Data Federation using Hasura

Connect Hasura to databases using ngrok TCP endpoints

Under “Data” tab, select the database type and connect to it by using a connection string

2. Track required data tables & relationship

For the current requirements, only employee data & relationship is required

3. Repeat the same steps for other databases

4. Test a GraphQL query

Fetching employee data from two on-premise databases

Further improvements:

Even though the setup was simple and we resolved the problem, there are still missing parts that must be addressed when deploying a production-ready solution. The hybrid connection must be configured to support authentication and authorization by integrating it with an authentication provider. Another nice-to-have feature is enabling mTLS to establish a zero-trust infrastructure, ensuring that even ngrok cannot see the payload the client sends. For Hasura, we can enhance observability by monitoring incoming and outgoing traffic, providing an analytical view of the system. At the data level, we can configure different rights and security mechanisms to ensure that data is not altered or viewed by unauthorized individuals. If functionalities require data validations, Hasura custom actions can be employed to achieve this.

Conclusion

In this article, I present the challenges and solutions to unify multiple databases hosted on-premise into a single API. I used ngrok to facilitate a hybrid connection and Hasura to federate multiple data sources into a GraphQL API. The solution presented is easy to configure and adapt, requiring no code to develop the API or integrate the databases.

What are your thoughts regarding the proposed solution? How do you typically approach addressing this type of problem?