What is data warehousing?

Our product runs on OLTP databases, Postgres and Mysql, which do not lend them to analytics. Warehousing is moving data from different OLTP(online transactional processing) sources to OLAP(online analytical processing) source, which is Amazon Redshift in our case. This ETL (Extraction, Transformation and Loading) process creates a layer optimised for and dedicated to analytics.

Benefits of data warehousing.

  1. ensure consistency across organisation as this becomes the sole source of truth .
  2. gives Historical business trends and snapshots facilitation decision making
  3. enables ad-hoc queries, canned reports, and dashboards via analytical tools. (we use Tableau)
  4. Central place for data from various sources helps facilitating joins for further drill down

ETL process

We warehouse supply and demand side data of our product at a granularity of 1 day to Amazon Redshift, which is a columnar database. The schema is a Galaxy schema also called fact constellation consisting of dimension and fact tables. We also build accumulation tables over fact tables for faster processing.

Techstack used

Apache spark is the backbone of our ETL process and Redshift being the warehouse.

Apache Spark is an open-source cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. We used Spark Standalone Mode with one master node (responsible for executing the task) and two workers (responsible for the operations performed on data)

Why we chose spark?

  • It can handle/process large amount of data efficiently (with data parallelism) using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
  • It powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.

Why we chose Redshift?

  • It is a columnar database, that could quickly perform aggregates, follows the sql standard so was easily adaptive.
  • Managed by Amazon AWS, it is on demand and scalable as our entire infra.

Data Flow

Extraction: Data from different sources, OLTP databases, GA is collected in based on edits/additions done done in the last day.

Transformation: This data is cleaned, formatted, validated, reorganized, summarized, and supplemented with data from many other sources in spark processing.

Load: Spark-redshift connected loads the data into redshift using temp s3 buckets .

Analytical Tools (Tableau)

The fact, dimension and accumulation tables from redshift are also integrated in Tableau in live/extract format for Data visualisation. Its drag-n-drop feature helps us create interactive visualisation within minutes. The Tableau reports are also embedded in our internal CRM tools for our sales team.

Few Sample Tableau Dashboards

Here are few sample tableau dashboards based on redshift.

Engineering @ Housing/Proptiger/Makaan

Engineering and technology articles from developers

Sakshi Jindal

Written by

Engineering @ Housing/Proptiger/Makaan

Engineering and technology articles from developers

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade