Data Warehousing and ETLs

What is data warehousing?

Our product runs on OLTP databases, Postgres and Mysql, which do not lend them to analytics. Warehousing is moving data from different OLTP(online transactional processing) sources to OLAP(online analytical processing) source, which is Amazon Redshift in our case. This ETL (Extraction, Transformation and Loading) process creates a layer optimised for and dedicated to analytics.

Benefits of data warehousing.

  1. ensure consistency across organisation as this becomes the sole source of truth .
  2. gives Historical business trends and snapshots facilitation decision making
  3. enables ad-hoc queries, canned reports, and dashboards via analytical tools. (we use Tableau)
  4. Central place for data from various sources helps facilitating joins for further drill down

ETL process

We warehouse supply and demand side data of our product at a granularity of 1 day to Amazon Redshift, which is a columnar database. The schema is a Galaxy schema also called fact constellation consisting of dimension and fact tables. We also build accumulation tables over fact tables for faster processing.

Techstack used

Apache spark is the backbone of our ETL process and Redshift being the warehouse.

Apache Spark is an open-source cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. We used Spark Standalone Mode with one master node (responsible for executing the task) and two workers (responsible for the operations performed on data)

Why we chose spark?

  • It can handle/process large amount of data efficiently (with data parallelism) using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
  • It powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.

Why we chose Redshift?

  • It is a columnar database, that could quickly perform aggregates, follows the sql standard so was easily adaptive.
  • Managed by Amazon AWS, it is on demand and scalable as our entire infra.

Data Flow

Extraction: Data from different sources, OLTP databases, GA is collected in based on edits/additions done done in the last day.

Transformation: This data is cleaned, formatted, validated, reorganized, summarized, and supplemented with data from many other sources in spark processing.

Load: Spark-redshift connected loads the data into redshift using temp s3 buckets .

Analytical Tools (Tableau)

The fact, dimension and accumulation tables from redshift are also integrated in Tableau in live/extract format for Data visualisation. Its drag-n-drop feature helps us create interactive visualisation within minutes. The Tableau reports are also embedded in our internal CRM tools for our sales team.

Few Sample Tableau Dashboards

Here are few sample tableau dashboards based on redshift.