Building our Data Platform: Rejecting an All-In-One Data Integration Tool

Iván Gómez Arnedo
10 min readDec 5, 2022

--

A detailed analysis of why we decided to not use these tools.

Originally published at https://blog.denexus.io.

Created with Dall-E — Image by author

Introduction

This is the third article of a series started with:

And continued with:

This article is different from the previous ones because we are not describing why we chose a tool/technology, instead we are describing why we discarded another one: We are not trying to sell any tool/technology, nor is it a detailed tutorial on how we implement a solution, instead it is the summary of the analysis performed on the available ETL (Extract, Transform, and Load) tools to see if their implementation made sense given our use cases.

ETL tools

Extract, transform, load tools are software packages that facilitate performing ETL tasks. There is a lot of hype in the data engineering world about these kind of tools but are they really going to replace most of the data engineering work?

Some good reasons about why it could be a good idea to use them:

  • Using these tools makes a lot of sense in terms of DRY and productivity: if more than X people have a need (i.e. extract data from Google Analytics) and each of those people implement a different solution (i.e. a Python script) then X-1 resources have been lost.
  • Benefits of scale + specialization.
  • It’s hard to find great data engineers so it could be a good idea to pay for tools that already solved your problems and dedicate data engineers to do something that brings more value.
  • API providers change their spec / standard, sometimes with notice, sometimes without notice and these tools mantain compatibility with a large number of popular data sources (and destinations) and apply the needed fixes in case of changes in these source systems.

At the end of the day, what’s better? Paying a third party to free up the time of your data engineers so they can devote themselves to more value-added tasks? Or, presumably, hire more data engineers (more expensive solution even if you don’t have to pay a third party) but have your custom pipelines where everything can be configurable because your team has developed the code?

But what if our use cases are not based on known sources (connectors), does it still make sense to use these solutions? Does it make sense to implement the standard and all the overhead of them when its main advantages (pre-defined connectors maintained by the community…) are not going to be taken advantage of?

Too son for that, let’s jump first to our requirements.

DeNexus requirements

We are following the ELT (Extract Load Transform) paradigm, in other words, we are not transforming the data when we firstly store it in our data lake. So, we are not going to modify the RAW data, instead, we are going to create different versions of it according to the different use cases of the company. This means that we are not going to use these ETL tools to process data (This will be done at a more advanced level of our data platform), so all the possible data processing functionalities of these tools are of no interest to us.

The requirements of our Data Platform are very specific, mixing data extracted for each client (Inside-out) with data taken from public sources on the Internet (Outside-in):

  • Sensible data extraction decentralized. Data can’t leave the customer environment without being encrypted, so the extraction process has to be done in the customer environment.
  • Data stored in the same region in which it was produced. If the data have not yet been anonymized, cannot leave the region in which it was produced.
  • Minimum deployment size of connectors. As some data extraction processes must be performed in the customer environments, the size of these deployments must be minimum.
  • Non-sensible data extraction centralized. If data is not customer-related (i.e., public sources) it could be extracted using centralized tools and they could leave the region in which they were produced.
  • A common structure for all the data extraction processes. It will speed up the development of new connectors and it will help in the on-boarding of new hires. Furthermore, it will help us following the DRY principle.

More information about other requirements of different levels of our data platform could be found in the other the articles of the series: Building our Data Platform: Why we have chosen Databricks over Snowflake, Building our Data Platform: Why we have chosen FluentD.

Tool review

In the ETL Tools section, we have seen why it can be a clever idea to use Data Integration / ETL tools and several comparisons between the existing options in the market.
As to discuss their differences is out of this article’s scope, we will focus on the rest of the article, on the tool we consider the best option available: Airbyte.

Why?

  • Open source.
  • Fast-growing.
  • A lot of already developed connectors (Increasing number): https://airbyte.com/connectors.
  • Easy to on-board new custom connectors (it enforces a common structure).
  • Motorization + a nice UI.
  • It could avoid the need of data engineers in the first stages of your data platform or if your use cases are common ones.
Airbyte architecture — Image by Airbyte

Installation

First, we are not able to use Airbyte cloud because it doesn’t meet our security requirement: “inside-out extractions have to be performed on the customer environments”.

So, we could only use their Open Source Version.

As described in their steps to install Airbyte open source, for a minimal installation, >5 Docker containers are needed:

  • Web App
  • Server
  • Scheduler
  • Database
  • DBT (optional)
  • 1…X workers

To check the number of Docker containers installed and their size, use the following command:

sudo docker ps — size — format “table {{.Image}}\t{{.Names}}\t{{.Size}}”

The “size” information shows the amount of data (on disk) that is used for the writable layer of each container

The “virtual size” is the total amount of disk-space used for the read-only image data used by the container and the writable layer.

All these Docker containers installed have a total deployment size of 2092 Mb:

Docker containers (and their size) on a normal Airbyte installation. Image by author.

A possible look to our Data Platform using Airbyte:

A hypothetical DeNexus data platform using Airbyte — Image by author

As we have described in our requirements, inside-out data will be extracted from our customer’s environments (decentralized extraction) and outside-in data will be extracted from the internet (centralized extraction). In this case we would be using Airbyte to manage all our data extractions, so we would need:

  • Same number of Airbyte deployments than our number of customers.
  • Another Airbyte deployment to manage the centralized data extractions (outside-in).

Custom connectors

When using Airbyte, to add a new connector it is necessary to follow the structure defined in its CKD (Connector Development Kit). That is, even if our connector is a very simple use case, it will always be necessary to structure it in the way defined on the previous link.

Generating an empty connector using the code generator. Image by author.

So, these templates are setting a set of good practices to follow and are also defining the structure to be used while adding a new connector to their tool.

Whether this is a good practice that simplifies the development or whether it introduces unnecessary overhead is up to the reader to decide.

A thing is well designed if it adapts to the people who use it. For code, that means it must adapt by changing. So we believe in the ETC principle: Easier to Change. ETC. That’s it.

First, given that you’re not sure what form change will take, you can always fall back on the ultimate “easy to change” path: try to make what you write replaceable. That way, whatever happens in the future, this chunk of code won’t be a roadblock. As far as we can tell, every design principle out there is a special case of ETC.

The pragmatic Programmer

According to its official documentation:

Typically for REST APIs each stream corresponds to a resource in the API. For example if the API
contains the endpoints
- GET v1/customers
- GET v1/employees

then you should have three classes:
`class YYYStream(HttpStream, ABC)` which is the current class
`class Customers(YYYStream)` contains behavior to pull data for customers using v1/customers
`class Employees(YYYStream)` contains behavior to pull data for employees using v1/employees

All “streams” must be defined as a class, and we must represent with a JSON_SCHEMA the structure of that data. What happens if from a single connector we have more than X streams with a changing structure? We will have to modify the code as many times as that structure changes (if we use the dynamic schema definition) or the JSON files in which these schemas are defined (if we use the static schema definition).

If a stream has more than 150 fields and we want to have all the data available in our Data Lake (for example, because we do not know the possible use cases that we can give to that data in the future) we will have to define a schema for each “stream” and if one of those fields’ changes, our schema will also have to change.

Why we decided not to use Airbyte?

The approach we follow is different because, as we have seen above, we implement the ELT paradigm, in other words, we extract all the data available by default and make it accessible in our Data Lake. By doing so, we can later define processes that will read and transform(T) ONLY the fields we are interested in, and we will NOT have to modify our pipelines or processes unless one of those fields’ changes.

Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row]

Spark’s Schema inference is so powerful because it allows us to delegate that work. Of course, we can have conflicts in the types, but we do not have to deal with and resolve those conflicts every time we load data to our Data Lake, instead we can solve them only when they really affect our processes or use cases. In other words, when we really use that data.

In previous section we have reviewed Airbyte and we have said that we think is the best data integration tool in the market, so that’s why some of the next points are specific to Airbyte but are also applicable to other data integration/ ETL tools:

  • Airbyte does not solve all the data extraction use cases (i.e., Building our Data Platform: Why we have chosen FluentD ), of course we could create a connector with that functionality but why reinvent the wheel when there are specific tools to do so?
  • An Airbyte installation needs 2092 Mb of disk. Every time we would like to on board a new customer, we would need 2 GB and +5 containers in that customer environment just to be able to start extracting data (i.e., to extract data from a simple API).
  • Strict rules to add new connectors. Its not clear how to add not-core data pipelines (i.e., How to add as a connector a web scrapping process).
  • We depend on third parties as we are using their suite.
  • They could change them pricing model and stop maintaining the open-source version.
  • A lot of work maintaining, deploying and managing Airbyte by the DevOps team. Lots of components without use in our current model but that we have to deploy and maintain. Updates could be a nightmare due to the number of interrelations between components. More number of resources, more possible points of failure and more code to maintain.
  • Actual version: v0.39.32-alpha.
  • What happens when a connector is not valid anymore because the source API (or equivalent way of extracting data) has changed? Do we have to wait until they implement the changes and release the latest version with the fix? Or will we have to develop our own connector in the meantime? They are not offering any kind of SLA.

Conclusions

“The success formula: solve your own problems and freely share the solutions.”
Naval Ravikant

During this article it has been described all the benefits of using data integration / ETL tools to standardize, centralize, and avoid reinventing the wheel in data pipelines; but it has been also described why these tools could not be a good fit for all the use cases.

When to use a data integration/ETL tool?

  • Your data sources are standard and well-known.
  • You don’t expect scalability to be a key factor in your data pipelines.
  • You have a centralized server from which to execute your data pipelines.
  • Your team has few Data Engineers.

When not to use a data integration/ETL tool?

  • Your data sources are not so standard.
  • Scalability is a key factor in your data pipelines
  • Data pipelines are executed from different environments.
  • You have enough Data Engineers in your team.

Due to our extensive knowledge of cyber threats (thanks to the DeNexus Knowledge Center) we know the sensitivity of the data that we are extracting from our customers, so we implement highly restrictive requirements before using that data in our Data Lake (to improve our models, to calculate our customer’s cyber-risk or to offer them a list of suggested mitigations to implement and decrease their cyber-risk):

  • All customer data is encrypted before leaving their site.
  • Each customer has it’s own encryption key, created and managed with AWS KMS. All customer data is de-identified before making it available in our Data Lake.
Want to Connect?
@data_cyborg

https://www.linkedin.com/in/ivan-gomez-arnedo/

--

--