How Amazon Redshift is Opening Up Big Data to Growth Hackers, Marketers, and Other Light Quants

Tristan Handy
5 min readOct 29, 2015

--

On February 15th, 2013, Amazon’s launch of Redshift changed analytics forever. Redshift unlocked big data — previously only accessible to a select few — to everyone with $100 and a SQL editor. In fact, Redshift unlocked big data for me.

I’m a “light quant”: someone with solid data skills who bridges the gap between the business and the data. I ask good questions, I can usually answer them myself with a little time and a SQL editor, and I’m good at understanding how to interpret, communicate, and operationalize the answers. But I’m not about to build the algorithm that Spotify uses to build my Discover Weekly playlist or optimize Uber’s traveling salesman problem.

I also don’t work for huge companies. I love 5 people, 20 people, and even 100 people, but I’ve tried working for large companies and it’s just not that much fun. Maybe you agree.

I ran analytics at Squarespace in 2009. We were around 20 employees at the time, growing quickly, and looking to fundraise. We did all of our analysis in either Excel or SQL (on MySQL), typically with row counts from 100K to 100M. It was painful, and spreadsheets and queries cranked. The workflow of waiting indefinite periods of time for queries to return was incredibly frustrating.

Here’s the thing: in 2009 there was no alternative. In 2009, sure, Vertica and Greenplum existed. But there was no way that then-nascent Squarespace could invest in the infrastructure and licenses needed. And I didn’t have the time or expertise to build out a Hadoop infrastructure on my own. In 2009, there wasn’t a good way for me to analyze large datasets well. We made do with what we had, and often that just meant not getting as much value from our data as we would’ve liked.

All of that changed in 2013 with the release of Redshift. My first major project with Redshift was the RJMetrics Ecommerce Growth Benchmark Report, where I got the opportunity to comb through the anonymized transactions of 200-ish online retailers.

With Redshift, I could finally write queries that ran over billions of records without fear. I still remember the first time I queried a large table — it felt like magic. I am not joking; I remember staring in amazement as the results to large queries returned. I had spent ten years of my life writing SQL on traditional databases, and I had a good sense for how long a given query would take to run. All of the sudden, queries returned in what seemed like no time at all.

I reveled in the opportunity to fire up my SQL terminal every day; for the first time in years I was excited about discovering the truth behind this virgin dataset. The time flew by and I enjoyed every minute of it.

Analysis was fun again.

I was most impressed with the response time for one particular query. I self-joined a 200M row table on a calculated field plus an id. Joins without indexes! Every instinct I had told me that this just wasn’t something you could do on a 200M row table, but I just shrugged and ran it. It returned in about 90 seconds. In that moment, I felt invincible.

The report ended up turning out incredibly well. We found some interesting insights, drove a ton of leads, and now rank #1 and 2 for the term “ecommerce growth”. We were so happy with the success that we proceeded to hire a full-time data scientist to push deeper into the dataset.

Big Data for the rest of us

There are a lot of people like me — marketers, growth hackers, analysts-in-training, light quants without deep pockets. In fact, if you look at LinkedIn (another dataset we’ve been analyzing in Redshift after doing quite a lot of scraping), there are 1.1MM light quants in the US, and 365,000 of them work for companies with fewer than 500 employees. That’s a lot of people for whom Redshift has unlocked an entirely new scale of analysis.

(Note that for the purpose of this very back-of-the-envelope calculation, I defined “light quant” as someone who listed SQL as a skill on their LinkedIn profile but didn’t list more technical data skills like Python, R, and machine learning. This data is current as of May 2015.)

It’s still unclear what this power will mean when it’s fully deployed to this huge swath of users. Will sophisticated SMBs now be able to compete more directly with large organizations because of their newfound analytical prowess? Will this power and accessibility inspire thousands more people to learn SQL? One thing we know for sure is that the advent of Redshift has led to a profusion of light, user-centric analytical tools that don’t have to solve the speed problem.

Redshift, of course, isn’t the only game in town any more. Microsoft released its Azure SQL Data Warehouse this year, and ex-Microsoftie Bob Muglia’s Snowflake is also in the running. But Redshift will always have a special place in my heart. Before Redshift, I felt like a toddler on a leash: I wanted to explore, but kept getting reined in. Redshift cut the leash, leaving me free to explore without constraints.

Of course, having Redshift, the primary challenge for business analysts has become consistently loading data into it. That’s why we built Pipeline. Pipeline connects to all of your databases and cloud platforms, performs a full historical copy to Redshift, and then continues to stream new data as it comes in. It’s free for up to 5M events a month, so give it a try and let us know what you think.

RJMetrics Pipeline is now Stitch. To learn more about this change, check out this post from Stitch CEO, Jake Stein. And to sign up for a free, 14-day trial of Stitch, head over to the Stitch website.

--

--

Tristan Handy

Founder and CEO of Fishtown Analytics: helping startups implement advanced analytics.