How TravelTriangle(TT) Infrastructure empowers faster data-driven decisions: Old Vs New Architecture

Rohan Agrawal
TravelTriangle
Published in
7 min readFeb 24, 2020

Writing this to help the relevant audience understands how TT had to build new analytics infrastructure which is scalable, user-friendly and cost-effective and place TT several steps forward compared to other competitors in the market in terms of data-driven decisions.

TravelTriangle(TT) is an online holiday marketplace that connects travellers to multiple local travel agents to help them create a customized and memorable holiday.

In my current or ongoing stint as analytics enthusiast, I came across various architecture with some common problems/challenges revolving around data silos, analytics etl runtime, real-time data flow & reports, ease of using reporting tool and limited or no predictive analytics etc which is faced by the analytics team, data sciences, consumers. This blog and article will take you through the challenges, problems faced and how did TT solve it.

Challenges & Key Problems in Old Infra

Data Silos or limited data: Data is locked up in a tool which makes it challenging to do cross-platform analytics and deprives analysts and stakeholders of taking the right decision at the right time.

Not able to measure campaign effectiveness i.e ROAS (Return on Advertising Spend) because the spend data sits in Google-Adwords accounts while revenue data sits in the MySQL database.

Not able to measure customer fingerprints during his entire journey from discovery > onboarding > successful transaction.

ETL’s Runtime: ETL’s not written in the incremental fashion or etl execution time increases linearly with an increase in data size because of the wrong tool is chosen or incorrect implementation.

Data models in the Qlik Sense visualisation tool become difficult to handle as data size grows with repercussions both on cost and time which results in a slowdown in decision making

Real-time data flow & reports: Businesses require to monitor the health on a real-time basis which cannot be solved without real-time data flow and reports over it.

We started with Amazon DMS for real-time data flow but AWS data pipeline has failed to meet our expectations here due to frequent issues and/or customization needed and the heavy maintenance required on day to day basis.

We even tried Flydata for a short while but they were getting quite expensive due to data size growth month on month and problems around duplicate entries in the database which leads to inconsistency in some of the financial reports

Required some powerful visualisation and query tool which can be used over Redshift to cater to real-time reports

No Predictive Analytics/ ML: Limited data in the analytics layer or no real-time data flows makes it difficult to build and implement data models.

If Point 1,2,3 is not solved, EDA (Exploratory Data Analysis) which is a pre-requisite exercise becomes a challenge or result in models with high likely chances of missing important features.

Model is of no use if not implemented correctly in terms of architecture.

Below is the image showcasing the older TT analytics architecture with problems around point 1,2,3,4 highlighted above

Old TT Infrastructure

Journey towards New Infra

New TT Infrastructure

Amazon Redshift and Segment (Segment is a single platform that collects, stores, and routes your user data to hundreds of tools with the flick of a switch) are the two bigger keys that had helped us in achieving our goals along with customization as per TT needs.

Data Silos or limited data: all our data now resides in Redshift which enables us to do cross-platform analytics and ML related stuff.

There were three types of data that were pushed from different sources to the destination (Redshift) using In-house data pipelines, RTMS & Segment.

  1. Mysql to Redshift: Transactional database is moved to Redshift through in-house data pipeline (earlier we had tried using Amazon DMS and flydata to do a similar job)
  2. External Tools Data: Data that were present in the respective tools is moved to Redshift through API with the help of the segment tool.
  3. Event Data: Data which is produced from TT websites and app is routed to Redshift with the help from segment tool
  4. next steps: Segment has become too costly for our volume and hence we are transitioning it with our in-house real-time messaging stream (“RTMS”) to relay events from any source to destination
Sources & Destinations

Real time data flow: In-house data pipeline and RTMS engine had enabled analytics team for real-time data flow in the Redshift and reports over it.

In-house data pipeline is capable of handling:

  1. almost real-time data flow with a max lag of 5–10 mins
  2. can handle duplicates entries in a particular table
  3. handles data type changes required from MySQL to Redshift
  4. new columns added in the existing tables
  5. full and partial re-sync for a particular table

We have all our data in the Redshift warehouse, the next step involves the processing of this data into analytics ETL layer.

ETL’s Runtime: As the name suggests, it involves reading data from Redshift, transforming them into business logics and then loading it back into the redshift for building reports and insights.

there were two jumps in terms of ETL time reduction that we had achieved over a period of ~1.5 years.

  1. moving to redshift: writing ETL’s (it involves joining of 50 odd tables with complex business rules written over it, a final processed table had close to 300 columns) on redshift had enabled us to execute our ETL codes within 1/5 of the time when compared to Qlik Data Model in the same cost.
  2. loading the etl in incremental fashion: as time passes and data grows in multiples of 3x on yearly basis, out ETL’s started taking time to execute in redshift too because of full reload exercise that we were performing, we later shifted it to incremental fashion i.e processing only last day updated records which further reduced our ETL time to 1/10 of the previous execution time

Our ETL’s are capable of handling duplicates entries and alerts the user whenever there is ETL failure because of syntax issues, server problems, unknowns etc which is integrated with the help of Opsgenie.

Next steps: we are working on Data lakes storage on S3 and exploring Dremio to further boost up etl time reduction

Data lake is a repository for structured, unstructured, and semi-structured data. Data lakes are much different from data warehouses (Redshift) since they allow data to be in its rawest form without needing to be converted and analyzed first

Dremio delivers lightning-fast queries and a self-service semantic layer directly on your data lake storage

We have recently done POC where we have stored all required raw tables on S3 in Parquet format, using Dremio connector (with one node server) we had tried running the same ETL code, we were shocked to see the results.

ETL was executed in a 1–2 minute time which is 100X faster than the current incremental loading in Redshift

Parquet stores nested data structures in a flat columnar format. Compared to a traditional approach where data is stored in a row-oriented approach

We have all our ETL’s in the Redshift warehouse, the next step involves connecting this to Reporting tools to draw insights

Visualisation Tools: We initially started with Qlik Sense 5 years back but because of existing problems such as complex Data Models, Data Models querying time, querying on Qlik sense, user experience etc we thought of revising/replacing the entire architecture keeping in mind our vision.

One of our initial motos was democratizing the data across organisations to make data-informed decisions at each and every level. We are looking for a tool which is user-friendly, faster in analysing the data. After extensive research and POC, we had finalised Superset and Redash as our final visualisation tool.

Superset is a data exploration and visualization platform designed to be visual, intuitive and interactive. Users can directly visualize data from tables stored a variety of databases including Presto, Hive, Impala, Spark SQL, MySQL, Postgres, Oracle, Redshift, and SQL Server

Redash is an open-source data collaboration platform that enables you to connect to any data source, visualize data and share it

Both of these tools can be directly connected to Redshift and power users can visualize and draw insights and do RCA by downloading the raw data. You must be thinking about why two different tools?

  1. Superset is good in visual, presentation and requires minimal knowledge of SQL
  2. while Redash is good for raw data download for RCA purpose, writing queries and for alerts on slack.

Since both tools are an open-source, question around security pops up: we had enabled the authentication through Gmail for all our TT employees which solve our purpose of data leaks and frauds.

When we initially launched Superset, the problem of slowness was reported by the users when we looked at the supporting metrics, it says 60% of the queries are executed within 5 seconds which was a serious concern. By enabling caching and Domain sharding we were able to reach at 95% (queries executed within 5 secs) with daily requests averaging around 15k

Please share your thoughts and feedback around this. Cheers !!!

--

--