Larry Ellison is Wrong About Amazon Redshift

You know that you’ve made it as a product / company when Larry Ellison throws punches at you. At Oracle World in September, he dedicated a whole set of slides during his keynote to Amazon Redshift, calling it “slow” and “20 years behind Oracle“ when it comes to functionality. Along with the keynote, Oracle even issued a whole press release with a benchmark comparison claiming that Oracle Cloud was “up to 105X faster for analytics than Amazon Redshift”.

Larry Ellison dumping it on Amazon Redshift

Well, we wanted to know and hosted an evening with SF Data on the topic “Advanced Data Acquisition with Amazon Redshift”, with three presenters in charge of building data platforms around Redshift at Yelp, Amplitude and Tenjin.

Yelp

Shahid Chohan is a software engineer on the Business Analytics and Metrics (“BAM”) team at Yelp. Yelp has 96 Million monthly active users, over 115M reviews and operates in 32 different countries. The BAM team provides data infrastructure support for decision makers and anybody within Yelp who cares about data (e.g. the people who build Yelp Trends).

In his talk “Streaming Messages from Kafka into Redshift in near Real-time” Shahid covers how Yelp moved from writing manual ETL to a streaming architecture with a “Redshift Connector”.

Yelp — Streaming Messages from Kafka into Redshift in Near-Real Time

The old system had too many moving parts, batch jobs implied latency, and as the company matured and the volume of data and users grew, the system wouldn’t scale.

With new streaming architecture, Yelp is shifting away from moving data via batches and scheduled jobs to a future where streams connect like building blocks in order to build more real-time systems. The Redshift connector makes it easy for teams to get data directly into Redshift, without writing a single line of code or manual schema migrations.

In November 2016, Yelp open-sourced their data pipeline, the repositories are available on Github.

Amplitude

Jeffrey Wang and Nirmal Utwani work at Amplitude, a product analytics service for mobile and web apps that helps product and growth teams get deep insights into user behavior and drive engagement and retention. Frequently, companies have custom questions they want to answer and custom metrics they want to measure that an out-of-the-box solution can’t handle, which is where Redshift comes in. Because Amplitude is already collecting the event data, it’s easy to ETL the data into Redshift so customers have direct SQL access. Amplitude also provides a “Redshift Playbook” to get customers started with a set of standard SQL queries.

Amplitude — Building a Redshift Pipeline with Dynamic Schemas

Today, Amplitude manages over 100 Redshift clusters, with hourly batch jobs. The pipeline that loads event data has a fully dynamic schema that is optimized for speed and convenience of queries. Before introducing dynamic schemas, the original way of storing event data was to have one huge table per app, with all events. But then queries for specific event types would have to scan all data, which is expensive and slow. A solution is to distribute data into one table per event type, each one with its unique schema, for each app a customer has. That approach however can test Redshift’s limit of 9,900 tables per cluster: Some bigger Amplitude customers, particularly games, have 10–20 different apps, with 500–1,000 event types for each app.

In their talk, Jeffery and Nirmal cover how they automated the process of creating a dynamic data schema, and how they worked around the Redshift limits for max number of tables and columns.

Tenjin

Tenjin provides mobile attribution, aggregation and analytics for marketers to analyze source, cost and LTV at the user level. Amir Manji is the Tenjin CTO, the Tenjin platform processes different types of data (events, clicks, structured relational data) into Redshift so marketers can perform custom analyses without having to build their own data infrastructure.

Tenjin collects two categories of data:

  1. Insert-heavy attribution & analytics data, mainly event-log type of information, with roughly one billion records per month
  2. Update-heavy marketing campaign data, structured / relational, with over 250,000 records per month
Tenjin — How to Ingest Different Types of Data at Scale Into Redshift

Amir covers how easy it is to get data into Redshift, even for a small team — and how at scale the ideal solution is to stream everything. Amir touches on potential solutions like Bottled Water (real time integration of PostgreSQL and Kafka) and Netflix/zeno (data capture and propagation framework).

Conclusion

A few things stood out that all presenters mentioned:

  • As companies mature, their ETL moves from home-grown, scripted solutions to more robust architectures.
  • Usage typically starts as a data warehouse, but over time more advanced uses cases emerge as part of a wider platform with custom apps on top.
  • For high ingress data volume, consider using Amazon Kinesis Streams or Kinesis Firehose to handle loading of data into Redshift.

If you want to be invited next time we host an SF Data event (which we typically do once a month), simply follow / like our Facebook page.