Synergy Part-1: Planning and Evaluation

Ujjwal Gupta
SquadStack Engineering
5 min readSep 12, 2023

Welcome back to our blog series on our in-house data warehouse solution! In the previous post, we gave an introduction to our in-house data warehouse solution. This blog post will delve into the planning and evaluation phase of setting up our in-house pipeline. Let’s get started!

Planning the Overall Process

To set up an in-house data warehousing solution, a well-thought-out planning process is essential. Here is a simplified overview of the steps we took in our planning phase:

  1. Identifying Requirements: We determined the specific data needs and goals of our organization, considering the types of data to collect, data update frequency, desired performance and scalability, and budget constraints.
  2. Defining Data Sources: We carefully documented the various data sources within our organization, including databases, applications, files, and external APIs that would provide the data for processing and storage in our data warehouse.
  3. Choosing the Right Data Warehouse Solution: After identifying the data sources, we evaluated and selected the appropriate database for storing our analytical data. This database would serve as the foundation for efficiently storing and organizing our data.
  4. Planning Data Extraction and Transformation: We meticulously planned the process of extracting data from the sources, transforming it into a suitable format, and loading it into the data warehouse. This involved selecting the appropriate ETL/ELT processes and techniques to ensure smooth and effective data integration.

Data Warehouse Evaluation

A data warehouse serves as a centralized and integrated repository for structured, historical, and voluminous data collected from diverse sources within an organization. During our evaluation of data warehouse solutions, we carefully assessed them based on the following criteria:

  1. High Performance: The data warehouse should demonstrate high performance, enabling quick loading and retrieval of analytical data. It should efficiently handle large volumes of data, ensuring fast query execution and response times.
  2. Cost Effectiveness: The data warehouse solution should be cost-effective, and capable of efficiently managing and storing the organization’s data without incurring excessive expenses. It should offer scalability options that align with the organization’s data growth and workload requirements.
  3. Data Security and Privacy: Data security and privacy are critical considerations. The data warehouse should adhere to data compliance regulations and industry standards. It should provide robust security measures such as role-based access control (RBAC), encryption of data at rest and in transit, and PII masking and scrubbing features.
  4. SQL Compatibility: Enables easy and efficient data analysis for data analysts and business users, leveraging their existing SQL skills and compatibility with SQL-based tools and applications.
  5. Managed Service: The chosen data warehouse solution should provide a highly available and robust managed service. It should offer features such as automated backups, disaster recovery options, and proactive monitoring to ensure uninterrupted availability and reliability.
Data Warehouse Decision Matrix
Data Warehouse Decision Matrix

After a thorough evaluation, our final choice for a data warehouse solution was Snowflake. Snowflake stood out due to its elasticity, separation of storage and compute, robust data security measures, cost efficiency, and SQL compatibility, making it the ideal fit for our organization’s needs.

ETL/ELT Evaluation

ELT (Extract, Load, Transform) and ETL (Extract, Transform, Load) are data integration processes commonly used in data warehousing and analytics. The key difference lies in the sequence of the transformation step.

During the evaluation, we specifically considered the data sources we needed to extract information. The primary data source in our case was Postgres, which served as the foundational data repository for our organization. In addition to Postgres, we identified other important data sources such as Trello, HubSpot, and Google Sheets.

Here are the evaluation factors we considered for assessing the ETL/ELT tools:

  1. Low Replication Lag: A low replication lag ensures that the data warehouse's data is up-to-date and reflects the most recent changes.
  2. Data Security and Privacy: We looked for an ETL tool that offers robust security measures to protect sensitive information during data extraction, transformation, and loading. This includes encryption mechanisms, access controls, and compliance with data protection regulations.
  3. Multi-Source Connector Support: Given that we had multiple data sources such as Postgres, Trello, HubSpot, and Google Sheets, we needed an ETL tool that could seamlessly connect and extract data from these diverse sources.
  4. Cost Effectiveness: Cost-effectiveness was a key factor in selecting an ETL tool. We considered the licensing costs, maintenance expenses, and scalability options to ensure that the chosen tool aligns with our budget constraints while providing the necessary functionalities.
  5. Support for Snowflake: Since we opted for Snowflake as our data warehouse solution, it was essential to choose an ETL tool that could seamlessly integrate with Snowflake.
Data Pipeline Decision Matrix

After evaluating numerous tools, we discovered that none of them met our specific requirements. Some exceeded our budget, others didn’t meet our SLA standards, and some lacked essential security features. This left us in a dilemma, unsure of how to proceed. After extensive discussion, we concluded that the best course of action was to build our own cost-effective and SLA-compliant in-house data pipeline.

In addition, we came across Airbyte, a highly effective open-source tool designed to handle data loading from secondary data sources such as HubSpot, Trello, and more. Recognizing its value, we made the decision to self-host Airbyte, allowing us to have greater control and flexibility in managing our data integration processes.

Summary

  1. The blog post discusses the planning and evaluation phase of setting up an in-house data warehouse solution.
  2. Key steps in the planning process include identifying requirements, defining data sources, and choosing the appropriate data warehouse solution.
  3. Evaluation criteria for data warehouse solutions include high performance, cost-effectiveness, data security and privacy, SQL compatibility, and managed service.
  4. Snowflake is chosen as the data warehouse solution due to its elasticity, separation of storage and compute, robust security measures, cost efficiency, and SQL compatibility.
  5. The evaluation criteria for ETL/ELT tools include low replication lag, data security and privacy, multi-source connector support, cost-effectiveness, and support for Snowflake as a data warehouse destination.
  6. Despite evaluating various tools, none met the organization’s specific requirements, leading to the decision to build an in-house data pipeline. Airbyte, an open-source tool, is selected for data loading from secondary sources like HubSpot and Trello.

In our next blog post, we will provide a detailed technical discussion on the development of our in-house data pipeline. Stay tuned for more and show your support by giving us a few claps if you enjoyed this post. 🤗

--

--