AWS & Snowflake vs GCP: how do they stack up when building a data platform?

Stefano Solimito
Unboxing Photobox
Published in
15 min readJan 7, 2020

--

When we talk about data, the number of technologies available on the market is overwhelming and staying up to date is a key challenge both for businesses and for engineers.

One of the reasons why I recently joined Photobox was to be in a data-driven company with the challenge of building a new data platform using some of the most cutting edge technologies available — AWS (Amazon Web Services), Snowflake and Looker.

I spent the last four years of my career mainly working on GCP (Google Cloud Platform), leading the development of The Telegraph data platform. After that I wanted to get out of my comfort zone and undertake a new challenge, starting to learn a new stack of technologies.

With this article, I would like to share the joys and pains of this learning journey and I will try to bring an unbiased comparison between GCP and AWS offerings.

Please keep in mind that this is from the point of view of a data engineer that has just recently started playing with AWS and Snowflake. For this reason it would be impossible for me to give a full comparison of the services offered by Amazon and Google, so I will limit this article to the areas that I touched while building and extending the new data platform in Photobox.

The aim of this article is not to discuss in detail Photobox’s data platform architecture, but I will try to give enough context to justify the decisions that we took during our journey.

This article is divided into three main sections that cover the flow of the data in our platform from Ingestion to Warehouse:

  • Event collection.
  • Data pipeline orchestration and execution.
  • Data warehouse technologies comparison.

In each section I will describe (at a high level) what we are building in Photobox, and to provide a comparison I will do an exercise to design the same on GCP side.

Event collection

One of the first things you have to figure out when building a new data platform is how you will ingest the data.

In Photobox we built a self-service platform that can ingest both real-time events and batch data. Multiple internal and external clients can push events to the platform. If an event conforms with the expected schema then it is ingested, otherwise it is discarded.

The event ingestion process in Photobox mainly uses the following AWS services:

  • API Gateway
  • SNS (Simple Notification Service)
  • Kinesis Stream
  • Kinesis Firehose
  • AWS Lambda functions
  • S3 (Simple Storage Service)

External events are sent through API Gateway and validated. If the payload is accepted then the event is published to Kinesis Stream that propagates it to Kinesis Firehose. Firehose, through a Lambda function, anonymises any PII data that might be present in the payload (based on JSON naming conventions) and stores the anonymised dataset into S3 from where Snowflake ingests the data.

Internal events are acquired in a similar manner, but using SNS topics that our data platform subscribes to.

Fig-1 Photobox events collection process

To design the equivalent process using GCP I would probably use the following Services:

  • Cloud endpoints
  • Cloud functions (GKE or App Engine are valid options as well)
  • Pub/Sub
  • Dataflow
  • Google Storage

In GCP, external events are handled first using Cloud endpoints and an application deployed in Cloud functions. This application will publish every event into a Pub/Sub topic. Dataflow will subscribe to the topic and consume the data applying an anonymisation function to each record, then it will store the data in micro-batch into Cloud Storage.

Internal events are acquired in a similar manner, events are published in multiple Pub/Sub topics then anonymised using Dataflow and stored in Cloud storage.

This is obviously only one solution to the problem. I’m sure that discussing with AWS or GCP experts, the same result can be achieved in multiple different ways.

Fig-2 Photobox events collection process as it would look like using GCP

If we start to compare the two solutions from the “external events ingestion” branch we can see that on one side we have Amazon API Gateway while on the other side we are using two components, Cloud endpoints and Cloud functions, to perform the same task.

For those not familiar with AWS, Amazon API Gateway is a service for creating, publishing, maintaining, monitoring and securing REST and WebSocket APIs at any scale. Routing requests from an API gateway into Kinesis is easy and doesn’t require additional components to connect the two services.

Fig-3 AWS API Gateway POST /event endpoint used to collect external events in Photobox data platform.

Looking at API Gateway limits we can see that by default it can manage a stunning amount of throughput — 10,000 requests per second (RPS) with a maximum payload size of 10Mb (see here for details on soft and hard limits of the service). During peak time at Photobox the number of external events per second that we collect might exceed that quota in the future but, luckily, this is a soft limit that eventually can be increased if needed.

On the GCP side, Cloud Endpoints is an NGINX distributed proxy-as-a-service that provides access control to your APIs and validates REST request. This layer can be easily integrated with an application deployed in Cloud functions (or another GCP service, like GKE) to efficiently route events into Pub/Sub.

Google Cloud Functions is a serverless execution environment for single-purpose functions that are attached to events emitted from GCP cloud infrastructure and services. It plays a role similar to Lambda functions in AWS. Cloud Functions can handle up to 100,000,000 requests every 100 seconds with a maximum payload size of 10MB (see here for detailed limits and quotas).

Both of the GCP and AWS services for event collection are serverless, fully scalable and able to handle a high volume of events. The main difference that I can see between the two solutions is the number of components used to achieve the same result — one on AWS vs two on GCP. I’ve always been a fan of simple designs, since having fewer components to maintain and monitor generally makes life easier.

Comparing the streaming and anonymisation part in Fig-1 and Fig-2 we can see that in AWS, Kinesis Stream and Kinesis Firehose (with a Lambda function) are used while in GCP, Pub/Sub and Dataflow perform the same task.

AWS Kinesis Datastream is a scalable and durable real-time data streaming service. In our platform it is used in front of Kinesis Firehose to provide seven days of data retention and read throttling (if needed). Data records are managed in shards. It is possible to scale shards up and down based on the needs. Because resharding (increasing or decreasing the number of Kinesis shards) is not automatic, Kinesis requires constant monitoring through CloudWatch to optimise the number of shards to efficiently handle the volume of data.

Resharding supports two operations: two shards can be merged into one, or a shard can be split in two. Each shard in Kinesis Stream can manage up to 1000 PUT operations per second with a maximum data blob size of 1MB.

Kinesis Firehose, on the other hand, is fully managed and scales automatically. This technology is used in our platform to automatically write data into S3 buckets prior to the application of an anonymisation Lambda function to each set of records. This ensures that no PII data are ingested in Photobox’s data platform to comply with GDPR.

On GCP, Pub/Sub plays the role of Kinesis Stream. This technology is scalable, durable event ingestion and delivery system used for stream analytics pipelines. It provides a many-to-many asynchronous messaging that decouples senders and receivers. It doesn’t require any management and can handle 1,000MB/s publishing throughput and 2,000MB/s subscription throughput with a maximum message size of 10MB. Detailed quotas and limits are available here.

Dataflow (Apache Beam) plays the same role as Kinesis Firehose with an anonymisation Lambda. It consumes messages from Pub/Sub, applies the anonymisation logic and writes in micro-batch into a Storage bucket.

In this case, GCP uses fewer components than AWS to achieve the same result. Also, Pub/Sub (unlike Kinesis Stream) doesn’t require any management to pre-allocate shards or to scale the process based on a monitoring system.

On the other side, Kinesis Firehose handles micro-batch partitioning into S3 automatically and supports inline a function defined in a Lambda. This seems more convenient than deploying a Dataflow pipeline in GCP and writing your own logic to micro-batch records into Cloud Storage.

In the internal events ingestion branch on the AWS side, Photobox internal clients are relying on SNS topics. Records are published into Kinesis Stream through a Lambda function, and lastly Firehose applies the anonymisation function and writes into S3.

On GCP side the same process would use two components. One or more clients can publish on a Pub/Sub topic(s) and a dataflow pipeline can consume, anonymise and write the records into Storage.

This second approach has fewer moving parts to be monitored, therefore it seems simpler to maintain.

Data pipelines orchestration and execution

Another key point when you design a data platform is which tool(s) you use to transform the data and how you orchestrate multiple data pipelines.

The transformation tool I’m really enthusiastic about is Data Build Tool (DBT), and all the ELT (Extract Load Transform) in Photobox’s data platform relies on it. If you want to know more about my previous experience with DBT, have a look at this article I wrote about DBT at The Telegraph.

Fig-4 How DBT pipelines are orchestrated in Photobox data platform.

As you can see from Fig-4, Apache Airflow is the scheduler of choice in Photobox, and it is used to orchestrate all our data pipelines. Airflow is deployed in Amazon ECS using multiple Fargate workers. As a part of Airflow deployment, AWS RDS is used as scheduler metadata storage and ElasticCache supports the queueing service.

Airflow has a pool of lightweight workers that don’t perform any computationally intensive jobs; their only purpose is launching ECS Task Definitions, which are the actual data pipelines. The transformation logic is written in DBT and wrapped in a multi-purpose Python application that allows the running of any operations that might be not supported in DBT.

A DBT pipeline defines transformations steps as models (SQL queries) which are linked in DAGs (Directed Acyclic Graphs — not to be confused with airflow dags). A DBT DAG ensures that a set of queries is performed in the correct sequence to transform the data as desired. Once the fargate worker that runs DBT has executed the task, then it is released freeing up unneeded computational resources. This ensures that the only application constantly running in our ECS cluster is Airflow, while all the data pipelines run on ephemeral workers that are alive just long enough to perform the task.

Fig-5 How the same solution can be designed in GCP.

If we design the same process using GCP, then Cloud composer can be used as a replacement for a custom Airflow deployment.

Cloud composer is nothing more than a managed version of Airflow, hosted by Google, that normally requires minimal tuning to be usable for most common use cases. On the other hand, if you are used to having full control of your Airflow this product may be too restrictive.

To run data pipelines, instead of ECS, Google Kubernetes Engine (GKE) can be used and Kubernetes Jobs can take the place of ECS Task definitions.

(Let me open a small parenthesis on this. Google normally suggests running your data transformations using Dataflow. I personally find Dataflow really useful with streams of data whereas, if your team speaks SQL, I’m of the opinion that DBT is a better tool to run batch transformations. DBT can fully unleash the power of your cloud DW and is designed to handle dependencies among models running your logic in a reliable way.)

Comparing the two solutions, we can see that AWS does not offer a managed version of Airflow while GCP does. This is not a deal-breaker, but personally I don’t like to invest data engineers’ time to manage and deploy infrastructure if not strictly needed. I think it is important to know how to do data ops but it’s also more important to know how to avoid doing it. This can free up resources that can be used to deliver real value to the business.

On the other hand, I was pleasantly surprised by ECS. Provisioning and de-provisioning Fargate workers is fast. It takes roughly one minute to have a worker ready to run your data pipeline and another minute to kill it. You might hit an AWS quota limit running multiple Fargate workers in parallel but this is a soft limit that can be circumvented by asking AWS for a quota increase.

On GCP side, in my experience, if a node in the GKE cluster can allocate the desired resources then creating a Kubernetes Job is really fast, but if the GKE cluster doesn’t have a node available it will have to auto-scale. This operation will normally require a few minutes, delaying the data pipeline execution.

Depending on the usage that you are expecting and time constraints that you might have when running your data pipelines, one technology might suit better your needs than another.

I think, in this specific context, you rarely see the difference until you have a data pipeline that has a short, predetermined time slot to run and requires high computational resources. In that case you might have to pre-allocate a node in your GKE cluster to be sure the resources are going to be ready when requested, or have a process that will force your cluster to auto-scale a few minutes before your pipeline is supposed to run.

Using ECS you can specify the hardware requirements for the task in the task definition and the Fargate worker provisioning time is likely to be similar.

Data warehouses comparison

This section could easily become an article itself. For this reason I will try to be concise, highlighting only the main characteristics of Snowflake (the DW that we are using in Photobox) and BigQuery, its GCP equivalent.

Snowflake doesn’t sit under the large AWS umbrella but it seemed to us the best option as a first building block for our platform.

The reason why we adopted Snowflake was mostly to overcome limitations that you might find in other products like Athena and Redshift. Athena is an interactive query service that, at the moment of testing, didn’t support any updates on existing tables. An essential feature like CTAS (Create table as select) was introduced at the end of 2018 and the allowed number of concurrent queries was not a fit for the number of tasks that we run in parallel in our data platform. With Athena we also struggled to handle high volumes of nested JSON records.

Redshift itself doesn’t support schema-on-read.

You can access the Glue catalog from redshift or use Spectrum to access data stored on S3; nevertheless, joins between spectrum or Athena tables with inner Redshift tables happen in Redshift, therefore query performances depend on the size of the Redshift cluster.

For all the reasons above we decided to opt for Snowflake, having a better decoupling between storage and compute.

Snowflake is Software-as-a-Service (SaaS) and uses a new SQL database engine with a unique architecture designed specifically for the cloud, that allows processing petabyte-scale-data with unbelievable speed. It has no limit on concurrent queries, assuming you have enough resources available to handle them, and requires minimal knowledge of the infrastructure that is under the hood.

Fig-6 Snowflake UI

The DW provides a web UI (Fig-6) and everything, from the permission management to increase the performance of a Warehouse, can be handled in SQL statements. Snowflake supports schema-on-read capability managed through views and stages, which allows smooth JSON schema changes in the ingestion layer. With Snowflake, raw data can be stored in S3 and accessed through external tables.

On GCP side, BigQuery is Software-as-a-Service (SaaS) and doesn’t require any infrastructure management. Like Snowflake, it can be accessed through a web UI and Cloud Storage can be used to host raw data with BigQuery also using external tables to access the data.

Speed-wise, it’s not easy to judge. It seems to me that Snowflake is faster than BigQuery in a head-to-head test and this fact seems supported by different benchmarks.

On the other hand, you might find engineers mentioning that in some situations BigQuery outperforms Snowflake.

The truth is that in different contexts you might get different performances and it really depends on:

  • Size and structure of the data.
  • Type of queries that you are running and usage pattern.

Regarding pricing models, with Snowflake you pay credits/hour for each virtual warehouse plus the data storage cost, which is normally negligible and aligned with your cloud provider costs. Basically you pay only when you have machines up and running, executing your queries and the total cost mostly depends on your usage pattern and the fact that your virtual warehouses are suspended when not in use.

Luckily, Snowflake provides an auto-suspension option to switch off a virtual warehouse when no queries are executed for a certain amount of time.

With Bigquery you pay for the amount of data you read during your query ($5/TB), plus the cost of the storage (currently $0.02/GB/month). When possible, the UI provides an estimation of how much data you are going to query so the pricing is usually fairly transparent.

Google also offers a flat-rate pricing plan as an alternative to the pay-per-query model.

I can see the pros and cons of both models and it really depends on what usage patterns you are expecting to do on your data warehouse.

Snowflake could become really expensive if you accidentally leave a virtual warehouse on, or need to have it constantly available. On the other hand, with BigQuery, having a team of analysts querying high volumes of data you could end up with a big bill pretty fast.

In both cases, education is the key. If you don’t want to lose control of your costs it is important to educate engineers, analysts and any other consumer to use these cloud data warehouse technologies in the proper way. Data has to be wisely modelled and optimised for consumption, and in some cases role-based restrictions might come in handy. This can help to keep control of who is accessing your DW (and how) and prevent data from being consumed in the wrong way.

Aside from performances and price, where I think Snowflake shines compared to BigQuery is the learning curve and how friendly (SQL-wise) it is.

When I started using BigQuery a few years ago I had to learn BigQuery (Legacy) SQL and, shortly after, BigQuery Standard SQL. In both cases I had to dig into the documentation to understand how to perform a simple cast or query a partition. I was simply unable to start to use the product without doing the proper reading. Google had to write guides on how to migrate codebases from Legacy SQL to Standard SQL since direct equivalents of some functions were not available.

The first time I got access to Snowflake I started writing SQL statements without reading any documentation. I didn’t know the proper syntax so I wrote it as though querying a MySQL database. I knew that my query would probably raise dozens of errors (and actually I did get a few) but fixing my syntax was trivial and it took me only a few moments to have a running query.

I do really appreciate the effort that the Snowflake’s team put in making their data warehouse easily accessible without the need to take a deep dive into documentation when I’m not using any function that is Snowflake specific. Sometimes I find myself writing queries as I would in Oracle and surprisingly, they work fine!

Conclusions

After my first four months in Photobox I can say that I deeply appreciate some tools that AWS offers and I totally fell in love with Snowflake. At the same time it’s hard not to compare with what GCP offers and, sometimes, I find myself thinking about how it would be to have some of GCP’s tools on AWS side.

It is obviously impossible to have everything and what is really better depends on the use cases that you have and on which tools your engineering team feels more confident using. One tool might seem great in a certain situation and less good in another. For this reason it’s hard to say one cloud provider is overall better than another.

What is for sure is that I’m glad of the opportunity to work with the great team of engineers that we have in Photobox and the freedom that we have to try new solutions and learn new technologies. Our mission to build one of the most cutting edge cloud-based data platforms is only beginning, but it makes me proud every time we add a new piece to the puzzle.

Stefano Solimito is a Principal Data Engineer at Photobox. You can follow him on LinkedIn.

--

--