What does it mean to ingest data securely into your cloud?

Raghotham Murthy
Datacoral
Published in
4 min readJan 13, 2020

One of the common challenges companies face is that they want to centralize their data present in multiple systems like databases, Sales tools, HR systems, Finance systems, Marketing tools, file systems, and even event streams into their data warehouse for analysis. There are different levels of data privacy/security concerns for different kinds of data in each of these data sources. Several of these data sources, like production databases, Finance systems, and HR systems have really sensitive data like personally identifiable information (PII), personal health information (PHI), salary information, financial transactions of either employees or customers of those companies. And with data privacy and security regulations like GDPR and CCPA, it is even more important for companies to have full fine-grained control and visibility into who gets to access what kind of data.

There are several ingest tools available to centralize data into data warehouses. Each of these tools involve making trade offs between ease of use, robustness and scalability, and data security.

Working with SaaS ingest tools

Ingest tools that are provided as managed services (SaaS) are easy to use. There is no need to deploy and manage software or provision hardware. These tools are built as multi-tenant architectures running inside of the cloud accounts of the vendors themselves. So, these tools require customers to open up their network firewalls to allow them to connect to the systems that are within the customer firewalls.

Figure 1. provides an illustration of a potential network topology of an ingest pipeline using a SaaS tool. Note that in this picture, we are assuming that companies have some network-level separation between their production environments (Production VPC) and their analytics environments (Analytics VPC). And, they use VPC Peering to be able to communicate between the two VPCs. This type of setup is both common and encouraged to separate access to production environments as well as to data.

Figure 1: Network architecture of SaaS ingest tools

Once the network firewalls are opened up, customers provide the tools the credentials to the data sources and the data warehouse. The credentials themselves are stored within the SaaS tool in a creds database.

Companies end up taking a big risk storing credentials of their applications, production databases, and data warehouses in a third party vendor’s environment where they have no control over who gets to see that information or the kind of security measures that are in place for the information.

The ingest tool reads the credentials, reaches into the data sources, fetches the data from the sources, and reaches into the data warehouse to copy the data into it. And in most cases, the tool also stores the data within its own cloud account (albeit temporarily) in order to provide robustness and data replay capabilities.

Furthermore, to comply with regulations like GDPR, the SaaS ingest vendors themselves have to spin up their entire platform in different geographical regions, a significant effort, the cost of which is passed along to the customer.

Companies typically blacklist data sources that have sensitive data and prevent those sources from being exposed to the internet. In cases where companies want to analyze such data in the warehouse there are few options available other than to build the ingest pipelines themselves from scratch or by using open source tools as a starting point, and deploying, monitoring, and maintaining them over time.

A secure, robust, compliant, and cost effective ingest solution

With the advent of serverless technologies, there is very little benefit in building multi-tenant architectures. Instead, SaaS vendors could spin up their services directly within the corporate firewalls of customers. One can then imagine a secure SaaS ingest pipeline within the customer network by leveraging a serverless ‘multiple single-tenant’ architecture. The network topology for secure ingest pipeline would look like Figure 2:

Figure 2: Network architecture of Secure SaaS Ingest tools

In this architecture, the ingest pipeline is running within the analytics VPC. And, VPC peering is used to route traffic directly between the two VPCs without any need to punch holes in firewalls and allow internet traffic. The Secure SaaS Ingest tool would be built as a serverless

This architecture is fully secure and does not expose the analytics infrastructure and more importantly does not expose the production infrastructure to internet traffic and make it susceptible to attacks.

At Datacoral, we are all-in on data security. Our patent-pending serverless architecture has made it really easy for us to provide a fully managed data pipeline within your Virtual Private Cloud (VPC). Our customers are able to freely leverage all of the data for analytics, irrespective of whether the data sources are within or outside of their network and without having to worry about exposing their PII and PHI data to the internet. No data leaves their corporate firewalls, and all data is encrypted using their keys. In addition, all credentials for the data sources are also encrypted and stored within the customer VPC, providing unprecedented security that enterprises need. Finally, Datacoral’s architecture naturally fits into the regulatory requirements of GDPR to confine data within geographic regions. Our customers benefit from this architecture at no additional cost.

Talk to us at hello@datacoral.co to learn more about how to get your own secure pipeline to ingest your PII, financial, and/or PHI data from your production systems into your warehouse in a matter of hours.

--

--

Raghotham Murthy
Datacoral

Entrepreneur, Ex-CEO of Datacoral (Acquired by Cloudera), Ex-Facebook, Ex-Yahoo