Modern Data Stack for Startups

cyber-venom003
Nybles
Published in
3 min readJan 8, 2022
Image Source: https://www.datafold.com/blog/the-modern-data-stack-open-source-edition

“Use the right tool for the job!”

This saying, which sounds quite simple at first, is much complex to implement in a practical aspect.

Early-stage startups, find it difficult to choose the variety of tools that are available in the ecosystem because it is quite unpredictable how their data is going to evolve.

Need of a Modern Data Stack

In the last 10 years, the software industry has seen a rise in:

  • Computational Power: Public Cloud Providers like AWS, Google Cloud offers enormous computing power at a standard market cost.
  • Data Sources: The rise of the IoT ecosystem, smart devices, have caused an exponential increase in the amount of data generated per day. In 2020, every person on the planet has generated ~1.7MB of data per second.
  • Data Literacy among business stakeholders: In the primitive software industry, analysts used to dig up excel spreadsheets manually to get some valuable insights about data. Today, many BI tools have proved useful in leveraging the power of data and giving valuable insights, thus creating literacy among business stakeholders.
  • Open Source Adoption in Data Projects: Industry has seen a tremendous rise in open source communities in the last 10 years. Many cool data tools (~Apache Airflow, DBT, Metabase) have thrived and evolved in open source communities.

The shift from Legacy ETL to Modern ELT

In this modern era, where most of the corporates are leveraging data-driven solutions, we are seeing a consistent shift to ELT architecture, from primitive legacy ETL architecture.

Source: https://images.app.goo.gl/KNBfjhMqPnhAzWC59

Modern ELT processing has become popular over legacy ETL because of the following reasons:

  1. Cheap, affordable, and efficient cloud storage and analytics services.
  2. Legacy ETL pipelines are not that flexible to easily adapt according to the exponential data growth.
  3. Modern ELT is fast as compared to legacy ETL as no strict transformation stage is involved before loading data into the warehouse.
  4. ELT tools are much proficient at simply plugging source data into the target system with minimal manual work from the user given that user-defined transformations are not required.
  5. Analysts can perform transformations on the data as needed in the warehouse with tools like DBT without taking care of insights and data types beforehand.

Adoption Strategy for startups

As mentioned previously in this blog, it is difficult for startups to anticipate the evolution of the data, they are going to deal with.

So, early-stage startups should consider the following things while choosing tools for their data stack:

  1. High adoption and awareness among other startups and customers.
  2. Which fits into the ELT model of the data stack.
  3. Database paradigms (e.g. structured, geospatial, entity-relations, search engine), that would fit into the requirement of storing and querying data produced by their domain and market.
  4. Equivalent open source alternatives of paid SaaS tools.

Extraction and Loading

Collection of data from all event sources, like web, app, backend services, and sending them to the data warehouse.

Paid SaaS Tools: Stitch, Fivetran, HevoData

Free and Open Source Alternatives: Singer, Meltano, Airbyte

Data Warehouse

A structured, nonvolatile, single source of truth of all data of the organization, where we can store and query all data.

Paid: AWS Redshift, Google BigQuery, Snowflake

Free and Open Source alternatives: Apache Druid

Transformation and Modelling

Creating models out of raw data with documentation for better consumption.

Paid: Dataform, DBT

Free and Open Source Alternatives: Talend Open Studio, Apache NiFi

Orchestration

Softwares for executing and orchestrating jobs that handle data flow.

Paid: Prefect.io

Free and Open Source Alternatives: Apache Airflow, Dagster

Visualization and Analytics

For better visibility and interpretation of data from different data sources.

Paid: Tableau, Microsoft PowerBI, Grafana

Free and Open Source Alternatives: Metabase, D3js, DyGraphs

About Me

Myself Tejas Agrawal, an undergraduate student at the Indian Institute of Information Technology, Allahabad. Currently, I’m working as Data Engineering and Analytics Intern at Saveo.in, where I find and develop data-driven solutions to business problems. I love to explore new industry standards and open source technology stacks.

Connect with me on LinkedIn.

--

--