Why we didn’t go with Redshift as our data warehouse

Aseem Bansal
towards-infinity
Published in
6 min readJul 17, 2018
Screenshot from why we chose Snowflake as our data warehouse

I had written earlier about why we chose Snowflake as our data warehouse. Going through it I noticed that for simplicity purposes I had skipped over some details which might help others understand out pain point better.

How Redshift came into picture

Our infrastructure is on AWS. When discussions about data warehouse started Redshift came into picture automatically. Going to anything non-AWS would have caused us to pay for data egress costs. This is typical of all cloud providers that I know about. Putting data in is cheaper than taking data out and moving away.

How data was stored in AWS S3

  • I took 2 event types as example but there were more.
  • The data was for around 3 years.
  • None of the files had fixed schema.
  • Some had schema that was changing. e.g. sometimes it is a number, other times a boolean. Sometimes those were mistakes, sometimes those were due to temporary one-off problems in our integrations but those were there and we needed to deal with them in our data warehouse of choice.

How Redshift was evaluated

This is basically arranged in very raw format. I tired this, this happened, this was the problem that happened. Basically I am writing down a log of what I did as far as I remember. Hopefully this can help someone who is evaluating Redshift.

I started simple. I took one of the events which had fixed schema and started exploring ways to put the data in.

Initially I was setting up VPCs and changing network rules in AWS. That was my first time doing the admin work with AWS so that took me some time.

After that was done I used AWS Glue to put the data in. AWS Glue is AWS’s ETL tool to put the data in various places. It is a wrapper on top of Apache Spark. The workflow was simple. Run a AWS Glue crawler on the S3 file. That crawler discovers the schema and create a metadata table in AWS Glue catalog. Using this catalog I was able to create the table in AWS Redshift. After that table was created I had to run a AWS Glue job to insert the data in. That worked.

Now I picked one of the events which had variable schema. When I ran a crawler on it AWS Glue was able to create the catalog for 1 file. As our data had multiple files I tried running the crawler on the AWS S3 prefix for the date. It created 470 tables in AWS Glue catalog for that date. We don’t have 470 type of events. We had around 20. We did have 470 files with that prefix. I tried using regex like data_root/20180101/EventA_* to run the crawler on events of 1 type and create single table for 1 type of events. Didn’t support that. Didn’t give a warning either. Just no tables were created in AWS Glue catalog.

When I tried to create a table in Redshift using a single table it failed. Why? There were 2 fields in our RAW data with the same name in different case e.g. fieldA and fielda. I should have gotten warning while I was trying to create the table. AWS Glue did not give me any warning. It started the Glue job to try to create it and it failed. I had to google search how to check the failures. There was a system table in Redshift which had errors. Found this error through that table. I wrote a jquery script to get the field names from AWS Glue’s table creation screen and found duplicates and removed them manually from table creation screen. The table creation was successful.

At this point I was thinking that I have been able to create a table using the schema from 1 file. Now what about the other 100 files of that event? This was an event with varying schema. Am I going to run database migrations in AWS Redshift when the schema changes? How am I going to do that? AWS Glue is not going to give me a warning when something fails. How am I going to automate this when schema can keep on changing with time?

I then editing Glue jobs created based on some examples I found in Glue samples. I was able to fix some of the incorrect schema types. Now only problem was multiple files with varying schema. I tried using AWS Glue’s dev endpoints. Some errors related to AWS Glue which I wasn’t able to figure it in the time I had kept for that.

Then I tried AWS Marketplace and found a ETL solution Matillion. That was bit nicer than AWS Glue. As far as I found it was basically a UI on top of Redshift command line which I had already tried and faced parsing issues. So gave up on that.

I came back to manually editing AWS Glue’s generated job. I thought if I leave aside varying schema would I be able to run the same job for inserting data in? AWS Glue by default was creating 1 table per file in Glue catalog. So if file name was Event1_20180101_001 then table was created with the same name. I was able to create the table with the name Event1 instead of Event1_20180101_001, Event1_20180101_002 etc. Now the problem was automating it. I tried using job parameters to pass the date. AWS Glue was not taking the parameters. I tried AWS’s forum and stackoverflow to see what might be the problem. Couldn’t get a solution.

This takes only a few paragrpahs to write. But in addition to that I had read blogs, read some experiences not all of which I have handy right now. It took me 2 weeks to get to this point in the data warehouse investigation. What was the situation? I did not have a solution for either the varying schema or automating.

Then I tried Snowflake. After the account was created it took me a total of 4 hours to read up and get the data in for 1 complete day prefix from S3 for all event types and start doing queries on them. That simple.

What was the conclusion and why

Based on the above it was clear to me that working with Redshift is going to take work. There were many knobs and controls to take care of when working with it. This would be considered normal by many people who have worked with databases. But one of the considerations that I think I had not mentioned in my original post was that we don’t have a large team. Some people could afford to have someone working full time on this. We didn’t want to. That is how we had been working so far with Apache spark and it was not productive. Also after 2 weeks I didn’t have a solution for the varying schema and automating it easily. With snowflake I had been able to get that done in 4 hours. So I decided to share the pros and cons with the rest of the team and we went with Snowflake.

If you are more interested in that you can read up on our experience with Snowflake too and see how it has been.

Created by Aseem Bansal. If you want to know when I write more articles or send me a message click here for the details

--

--