Written by Rashmina Menon, Big Data Engineer on January 11, 2018
Amazon Redshift is the primary data warehousing solution used at GumGum. Apart from the real time reports which are powered through Druid, Redshift fuels the majority of our reporting capabilities. Being a fully managed solution from Amazon, we do not have to maintain the Redshift cluster, be it the hardware or the redshift engine. Redshift is column oriented and its massively parallel processing architecture (MPP) makes it petabyte scalable. Amazon Redshift also includes Redshift Spectrum which can directly query unstructured data stored in S3.
A simplified picture of our ETL process looks as below:
The raw ad server logs are parsed by the Spark jobs which are loaded into Redshift raw fact tables. The raw tables are then aggregated into tables with less granularity so that the queries can run faster. These tables are queried by our reporting applications. However, at times we have to support ad hoc requests which require us to reparse the ad-server logs. This is because the data in raw tables is only retained for 35 days.
Our Redshift cluster consists of 14 dc1.8xlarge nodes, each which has 2.56 TB SSD storage. Our current data usage is almost 56% of this storage out of which 50% is used to store 35 days of raw tables alone. Hence retaining raw tables for longer duration demands increased storage needs of the cluster which is not always ideal. This data is not queried often. But Redshift architecture before Spectrum forced us to add nodes whenever we needed more storage. Adding more nodes resulted in adding more storage as well as more compute. Since it added both storage and compute, it was more expensive as well. Our real need was to just add more storage at a lower cost.
That is why Spectrum was a big relief to us. Spectrum and Athena make separation of compute with storage possible. It uses S3 as a data store and in S3 you pay for what you use. Compute is charged per query basis. Hence we decided to use Redshift Spectrum to support ad hoc querying which needs data from raw logs for longer durations of time. Our ETL process just had to be modified slightly to store the data in parquet format in S3. Parquet format ensured that the data is compressed, storage and query efficient. For optimal S3 performance, all data files were decided to be around 128 MB. The same data can also be used by Amazon Athena for querying. The modified ETL looks like this now.
Spectrum and Athena works on pay per query model where the charges are $5 per 1TB of data scanned. For efficient use, Spectrum and Athena supports partition keys which provides a lookup to the S3 folders that should be queried. As most of our ETL jobs run hourly, our partition key is a timestamp field with hourly granularity. The structure of the S3 folder should be similar to s3://path/to/the /folder/partitionKeyName=year/month/day/hour to support this partition key. For example, a valid folder name would be s3://parquet-logs/ad_events/partitionKey=2017/12/25, and the partition key also has to be named partitionKey in the schema definition of the external table.
The structure of our external tables in Spectrum looks as below:
We alter the table hourly to add the new partition value to the table. The values of the partition keys, values and the S3 location it refers can be seen in the system table svv_external_partitions. The queries in Spectrum and Athena should use the partition key to limit the data being scanned by the engine.
Amazon Redshift Spectrum external tables do not support statistics. The database engine uses heuristics or simple row counts to determine the join order. By default, when there is a query which joins external table with the Redshift table, external table is assumed to be the bigger one. Otherwise it looks for a table property called numRows to determine the larger table. We set this property every day as a part of ETL for efficient querying.
Based on the basic benchmarking, we found that Redshift Spectrum is 2–7 times slower than Redshift. Athena and Redshift Spectrum results are comparable, however when queried over long range of time, Athena seems to slightly outperform Redshift Spectrum. Redshift Spectrum has an advantage that it supports joins and other query operations with the tables stored in Redshift, while Athena does not.
To conclude, Redshift Spectrum removes hassles of scaling, maintaining servers or provisioning by its serverless architecture. In addition, it separates storage from compute and allows us to scale separately as needed. We are optimistic that Redshift with Redshift Spectrum will further democratize our data and ease the ad hoc reporting needs reducing constraints on the data storage.