Hand-Picked Tools for Building an Open-Source Data Platform

Anjan Banerjee
HCLTech-Starschema Blog
5 min readJan 5, 2023
Photo by Marcello Gennari on Unsplash

Open-source alternatives to market-leading tools and technologies can help you build and optimize a flexible, performant end-to-end data platform. This is a good option for organizations that want to achieve greater control and significant cost savings and can afford to put in some time and effort in exchange for these advantages. While this approach may not be for everyone, it certainly can sing for those it is right for. If your organization is looking to optimize the cost of its data platform and ensure lean but easily scalable operations, this may be for you. Let’s dig in.

For most use cases, a purpose-built open-source platform will include a data management system, a data integration tool and a data visualization tool — other technologies might also be needed depending on typically more unique needs. In this post, I’ll discuss each of these categories, with recommendations based on the experience of Starschema’s data engineering teams.

The basic building blocks of a data platform

Data Management System

The right choice for a data management system (DMS) depends on the type of source data you have. If you are dealing with mostly relational structured or semi-structured data, going with the self-hosted CockroachDB core is a good choice. It’s compatible with ANSI SQL and also provides native support for JSON and geospatial data. CockroachDB is a great combination of common open-source DMS platforms like MySQL/PostgreSQL and MongoDB.

Of course, if your platform needs are less complex, then MySQL and PostgreSQL will probably work perfectly, as they are very easy to manage and administer. In fact, you can even consider using SQLite.

I prefer to install these tools on Kubernetes or in Docker containers. Follow this link to learn how to install CockroachDB using Helm.

Extracting, Transforming and Loading

The next big question will be how to bring the data into your choice of DMS. This decision goes hand-in-hand with the DMS tool selection, as you need to consider what source systems you want to pull data from and load it into the DMS.

Airbyte performs well as your open-source data integration platform. Airbyte has connectors available for the most common data sources and also gives you the capability to write your own connectors. It also empowers you to perform change data capture (CDC), which helps to further optimize resources — which is one of the main things we’re trying to achieve when we’re going with open-source tools over more comprehensive product suites.

Once the raw data lands in the DMS, it needs to be modeled and transformed from its raw format to an aggregated usable state.

The stages within the DMS

dbtCore is a great tool for transformation — it doesn’t even require any installation, as it can be used as a Python package.

The next step is to create and implement a data model that defines how the data will be organized and accessed within the data platform. There are three common forms of data modeling techniques that you can use:

· 3NF data model;

· dimensional data model (star schema and snowflake schema);

· data vault 2.0 model.

The most common operations — data cleaning and the monitoring of data quality — take place when moving data from one stage to the next A good option for this purpose is Great Expectations, an open standard that will help you control data quality on your data platform at every stage.

Multiple tools are needed at this stage are needed. To coordinate the processes, you should consider using an orchestration tool that will make sure all dependencies, as well as the observability of the data flow, are maintained. Dagster works brilliantly for this. It can integrate with all the above-mentioned technologies and enables you to create a directed acyclic graph (DAG) to make sure all dependencies between objects and stages are matched.

A data catalog and lineage are two additional assets to consider if you want to enable self-service operations and drive data democratization within your organization — which you should, as they can contribute greatly to more streamlined and effective operations. For this, look to OpenMetadata and DataHub. They not only help with cataloging your data but integrate well with your other tools and also create a lineage to help you better understand the flow of data between systems and stages.

Click the following links to find out how to use Helm to install Airbyte, Dagster, OpenMetadata and DataHub.

Data Visualization

Lastly, let’s tackle the visualization layer of an open-source data platform. For very basic requirements, Metabase should suffice — it has a sleek, modern monotone look and makes it very easy to create dashboards. And if you’re looking for a bit more color and visual flair, Superset is a worthwhile alternative.

Again, follow the links to get a Helm installation tutorial for Metabase and Superset.

The Bottom Line

Although building a data platform with open-source products can require a significant investment of time and effort, it can provide significant benefits in terms of cost savings, flexibility and control. Rather than high up-front licensing fees, the cost of building a data platform with open-source tools largely manifests in the form of hosting and operational team expenses — which are still significantly lower overall than what you would pay for major products and platforms.

And better yet, you can even reduce the effort spent on installing and hosting these tools if you go with the SaaS offerings that are available for most of them. Look forward to a follow-up to this piece on building a data platform with SaaS products.

Contact us if you need support in using IaC to streamline the deployment of your new modern data stack components.

About the author

Anjan Banerjee is the Field CTO of Starschema. He has extensive experience in building data orchestration pipelines, designing multiple cloud-native solutions and solving business-critical problems for multinational companies. Anjan applies the concept of infrastructure as code as a means to increase the speed, consistency, and accuracy of cloud deployments. Connect with Anjan on LinkedIn.

--

--

Anjan Banerjee
HCLTech-Starschema Blog

Senior Solution Director @HCLTech || Former Field CTO @Starschema Ltd