Breaking into Big Data at SnapTravel

Published in

Super.com

9 min readApr 29, 2019

Big Data. The Two Favourite Words of Every Data Engineer!

SnapTravel is a startup headquartered in Toronto with a strong data-driven culture. Aside from business decisions, data plays a large role within the organization from hiring decisions all the way down to the frequency of in-person standups. As an Engineering Intern at SnapTravel, I was immersed in this culture starting from Day 1.

The Project

My project, that unbeknownst to me would take three months to complete, seemed rather simple: Design and Implement SnapTravel’s Supply ETLs:

Supply/Demand: Measuring consumer demand for hotel properties and the corresponding supply provided by SnapTravel
Supplier Availability: Measure SnapTravel supplier responsiveness to consumer search requests
Expired Rates: Measure the number of expired rates* for hotel properties, on a per supplier basis

Three ETLs, how bad could it be? Oh, how wrong was I.

*Expired Rate: A rate (for a hotel room) which the supplier increased the price they were originally selling it for

The Design Doc

To be honest, I had no idea where to begin as I was not clear on the exact requirements nor the business use cases for this data. In fact, it was quite surprising to discover that there was no business owner for this project, considering it would be our BizOps team who would be consuming this data. As any good engineer would, my first instinct was to create a design specification that would outline the requirements and key results we expected from these ETLs, in addition to the technical details of how we would engineer them.

Below I explain my thought process, both when making technical design decisions (related to these ETLs) in addition to critically analyzing why SnapTravel chose to build their data infrastructure a certain way.

Hadoop v SQL

These ETLs would need to consume user search request data and the corresponding rates provided by SnapTravel. The raw data exceeded 500 GB in our data warehouse Snowflake requiring intelligent solutions for aggregating, transforming and extracting useful information.

If you ask any Data Engineer at the Big-N companies, they may jump to the solution of using Apache Hadoop and its’ signature MapReduce programming model. In our case MapReduce could be used in the following manner:

MapReduce takes a large dataset “mapping” (splitting) it across several nodes (or servers) for parallel compute. Then these results are “reduced” (combined) into one data source.

So, why didn’t we end up using Hadoop? Well to understand that, let’s take a look at the pros and cons of this widely used big data technology in relation to SnapTravel’s needs at the time

Pros

Scalability: MapReduce allows for parallel compute utilizing resources across several nodes, greatly reducing compute time. This allows for a quick pipeline between the aggregation of data to the presentation of results. In addition, you can work within a framework instead of dealing with raw SQL strings!
Availability: Since compute is distributed across several nodes if a single node (server) were to go down the client would get data from the available servers and receive the remaining data later (low wait times!)

Cons

Increased Infrastructure Maintenance: With a data team of two full-time engineers and an intern, creating and maintaining a Hadoop infrastructure would be a strain on resources. We had other responsibilities with maintaining our relatively new Data Warehouse (Snowflake) that Hadoop would stretch the team quite thin (even Uber didn’t use Hadoop at first)
Cost: While Hadoop is cost-effective for organizations with very large sets of data (Terabytes to Petabytes of data), the cost of operating multiple servers for three moderately sized ETLs did not make much sense, yet, since we were already spending a large amount of money for external services like AWS.
Hadoop’s Model: While Hadoop’s model is excellent for ingesting continuously updating data feeds (with its high availability) we were dealing with much more static data (data that would get updated once daily). SQL is better at guaranteeing data consistency rather than availability, which is what we needed with these ETLs.

Airflow

Great, so now I could create the queries but the next hurdle became, how would I automate these tasks so that they could be run daily with minimal supervision?

There were a few options in terms of scheduling tools, however, the two that stood out were Luigi (a data pipelining tool built by Spotify) and Airflow (a data pipelining tool under the Apache workflow). So why did SnapTravel choose Airflow? Let’s take a look below:

Task Management

Airflow is a tool where the ideology is more or less “set it and let it be”. Airflow works using a DAG (Directed Acyclic Graph) data structure. Each Airflow job can be specified as a DAG, constructed using multiple tasks (each task can detail other upstream, downstream or parallel tasks). Airflow has a built-in scheduler which uses the DAG definition, in addition to the current state of running tasks (in its metadata database) to decide which task to run next

Usability/Maintainability

Running three ETLs may not seem like a big deal, but when making a decision about integrating with 3rd party software/tools you want to think about usability and maintainability as it can affect the speed at which an organization scales
Airflow has a very powerful UI that allows you to visualize, modify and execute DAG’s while providing helpful tools to debug/identify issues/failures
Cool Fact: Google’s Cloud Composer is backed by Airflow!

Distributed Execution

Perhaps Airflow’s most significant advantage over Luigi is the fact that Airflow supports distributed execution:

https://blog.sicara.com/using-airflow-with-celery-workers-54cb5212d405

The Airflow Scheduler sends a task to the Celery Executor. The executor then sends this task to a broker (such as Redis or RabbitMQ). Finally, the task is executed on a Celery worker (with the return value logged in a Metadata Database)
Having an executor like Celery completes the automated end-to-end process (airflow scheduler to celery worker), leading to efficient data pipelines and inevitably a shorter development cycle

Optimizations

Finally, we had all the tools needed to design and execute these ETLs, but I soon realized that this task was far from being complete. The first issue we came across was the ETL runtimes. As a growing organization, our data was constantly increasing (in terms of size). What started off as 50MB/day of raw data in November 2018 easily scaled to 1GB/day by the end of January 2019, with no signs of slowing down.

Our initial idea was to run all 3 ETLs in parallel to one another. Since each task was independent of one another this seemed feasible at first. What we soon realized was: As the data increased in size so did our querying time. Since all the queries for these ETLs were executed on Snowflake (our Data Warehouse), the ETL run times became so large that Data Analysts were not able to run their own queries (due to limited computing resources).

Since Hadoop was not a feasible option at the moment we had to find another way to optimize these queries.

Phase 1: The Bernoulli Sample

Out of the three ETLs, the one taking the longest time was Supplier Availability, with a daily run time averaging 5 hours. Our goal was to get this ETL to run in 1 hour or less.

This is where we went back to the drawing board with our BizOps team to really flesh out the use cases for this data. As it turned out, the main purpose of Supplier Availability was to compare the responsiveness of SnapTravel suppliers relative to one another. This meant that we did not need to have “exact numbers” for each supplier and instead could create a sample population to make these comparisons.

We knew that the type of sampling we wanted was Bernoulli Sampling as each element of the population had an equal chance of being included in the sample. And luckily for us, the Snowflake Engine supported the Bernoulli sampling method.

Modified Supplier Availability ETL Pipeline

Introducing a 10% Bernoulli sample reduced the Supplier Availability ETL runtime by 90% (making the average runtime around 30 min). While this exceeded our initial goal (< 1hr runtime), it was not time to celebrate yet as we had made a huge oversight in our new sampled pipeline, causing some of our data to become inaccurate and thus unusable by our BizOps team.

Phase 2: Downsampling??

As sampling seemed like a success, we wanted to use the same sampled data for the other 2 ETLs (Supply/Demand and Expired Rates) as they would run into the same problems as Supplier Availability quite soon. Shortly afterwards, it was pointed out by our business team that there were some inconsistencies in the Expired Rates data. The root cause of this? Our sampling method.

A part of the Expired Rates ETL used a query to find the lowest rate provided by each supplier for a given search request. This rate was then classified as either Valid or Expired. As you may recall, we used Bernoulli sampling on the rates table to select 10% of the raw data. This meant that the lowest rate per supplier will not always be included in the 10% sample we extracted. Since this minimum cost was used to directly identify if a supplier provided an expired or valid rate, we needed to modify our original sampling solution.

One option to fix our sampling issue was to use Downsampling. Downsampling is essentially reducing the rate at which data is sampled. In our case we would log our search requests (and the corresponding rates provided) in larger “time” intervals, meaning that the sampled data would be reduced in size while maintaining accuracy. The second option was to modify our existing solution to take into account the “minimum rate” case. We decided not to move forwards with downsampling due to a single important reason: Single Source of Truth (SSOT)

Downsampling would mean that our “primary data”, for the ETLs (in Snowflake) would be a sampled source of all search requests and rates. While this would work effectively for the three ETLs it violates the Single Source of Truth Principle: Every data element should be stored exactly once (and any references to that data element should lead back to a single source).

**MSOT** (Left) vs SSOT (Right) Architecture

With an MSOT (Multiple Sources of Truth) architecture, the logged data can live in either the sampled or raw data DB meaning that any references to this logged data can lead back to multiple sources
With an SSOT architecture, the logged data lives in the Raw Data DB (RDDB) exclusively, and any references to this logged data all lead back to the single RDDB instance

Having an SSOT architecture allows for data accountability/traceability as our Data Warehouse scales.

Since downsampling was not the right solution, we decided to modify our existing sampling process as follows:

Using the solution above, we were able to maintain accuracy across all ETLs while still keeping the runtimes reasonable (as our data scaled)

And Finally, The End Was In Sight!

Working on this project allowed me to gain a deeper understanding of the complexities and nuances associated with big data. In addition, it gave me the creative freedom to explore alternatives to industry standards and critically analyze the delicate balance between technical ease, maintainability, and sustaining resources.

Above all, I learned that a Data Engineer’s work is not finished once data is in the warehouse (otherwise I would have been done in two weeks)! Working with our BizOps team to QA data, with our Engineering Manager and Senior Data Engineer to optimize infrastructure, and documenting the resultant data (yes column-by-column) was a long (and sometimes painful) but important process to ensure these ETLs were accurate and met business requirements.

While my time at SnapTravel may be coming to an end, I hope that the data produced by these ETLs will continue to be of use long after I am gone!

Thanks to Nehil Jain (SnapTravel’s Data Team Lead) and Ryan Fox (SnapTravel’s Engineering Manager) for reviewing this article!