Remind’s data engineering team strives to provide the whole company with access to the data they need and empower them to directly make decisions. Our data infrastructure currently processes around 100 million events a day.
The evolution of the data engineering ecosystem
The data infrastructure ecosystem for analytics has evolved a lot in the past decade. In the early 2010s everyone was trying to painfully shoehorn everything into the Hadoop ecosystem regardless of the actual amount of data they had. Then Redshift came along and became the new standard for web companies operating below a few petabytes of data. Thanks to the plethora of dashboard systems and other third-party tools, it was natural to make Redshift the single source of truth for business and product analytics data.
However, the success of visual query tools — most of them event-centric — such as Mixpanel or Interana showed that other paradigms and tools were also key drivers for democratizing access to data. These products were opinionated, designed for fewer use cases than standard databases, and easier to pick up by most people.
For bigger companies which need to be flexible enough to cover many different use cases, this implies that your data has to end up in multiple places, versions, and formats. This incentivizes you to revisit the way you think about your whole data architecture. Keeping your canonical source of truth data within Redshift means that you would have to spend your time extracting and transforming it before passing it to other services. The most straightforward way is to switch to an S3-centric data architecture where raw event data lives first and foremost on S3, and can also get streamed or batch loaded to other systems.
What we started with
Remind initially started with Redshift as its source of truth resource for data. We then decided to also use Interana for exploratory and product analytics, and were also thinking about moving older event data out of Redshift for cost reasons (S3 storage is way cheaper than a Redshift node), while still making them queryable. It was clear that we needed to move to a more flexible data architecture.
We were lucky enough to already have most of the upstream pipelines in place. We already used Kinesis Firehose to stream our event data to Redshift and S3 as well for backups. The backup S3 buckets also were used to feed event data into Interana.
In addition to desiring new ways to access our data, we also had some big pain points with Redshift, the biggest one being that it’s very hard to do large table backfills in Redshift. Say that you want to create a new user rollup tables (aggregating various actions for different platforms and time periods), this very expensive query would compete for cluster resources against regular day-to-day queries and table writes. Redshift is also not optimized to update large amounts of data. Usually the best way is to just re-create a new table to ensure all the new (updated or not) rows/columns are properly sorted. This requires a lot of free disk space, especially for tables with dozens or hundreds of columns. We once tried to deep copy a table with about ~20B rows — which used about 7% of our disk space — and even though we had 65% free disk space Redshift ran out of storage during the deep copy. Running a VACUUM on the table also made us run out of disk space. Increasing the cluster size by a factor of two during the deep copy was the only way to get the table to be properly sorted.
It was clear that we needed to invest in some technology able to transform our historical data in a scalable way. Most people use EMR/Hive or Spark, either with permanent clusters or transient ones. We had some experience using EMR via MrJob for various ad-hoc tasks where Redshift was unable to transform data in the format we wanted or too slow.
AWS Athena was released around the same time we started prototyping some of our data transformation and backfill jobs with EMR/Hive. Athena is a fully managed Presto system which bills per amount of scanned data. Unlike EMR and Hive you don’t even have to worry about picking instance types or tweaking various mapper/reducer settings to get the best performance and lowest cost. And unlike Redshift (though that just changed with Redshift Spectrum), you can instantly create a table by just pointing to an S3 URL and specifying the schema and format.
Even though Athena is still primarily designed for interactive, handwritten SQL queries, they do allow you to programmatically make queries through a JDBC driver. We decided to test it by trying to backfill a few large user rollup tables we had just created.
The first step was converting our Firehose JSON event data into the Parquet format in order to properly leverage Athena’s columnar data support. That turned out to be the most difficult part. Iterating with transient clusters in Hive is a painful process, and there was a bug in AWS’ hive-hcatalog-core JAR file they recommended to use in their guide which took a long time to debug.
We already had the Redshift SQL code to generate these tables. Duplicating and porting that code to Athena SQL would be a source of many potential errors in the future, as we would have to always maintain parity between the two files. We decided to write a custom Redshift to Athena SQL translation module, supporting only the Redshift and Athena functions and SQL functionalities we actually used. This allowed us to write some SQL table updates once and run them seamlessly in either Redshift or Athena. Surprisingly, big queries were faster by a factor of 3x to 10x in Athena, even though our Redshift tables are highly optimized in terms of distrubution keys, sort keys, and compression.
We now store our historical event data in S3 and query it via Athena. Redshift includes the last 3 months of event data. This allowed us to significantly decrease our Redshift cluster size.
Athena is still rough around the edges. The JDBC driver initially had many bugs, most of them which got fixed in the last month. You don’t have any control over the format of the result files Athena writes to S3: a single CSV file with quoted fields, a format which is not even supported by Athena when loading tables. Because their clusters are shared resources, it is not uncommon that your cluster isn’t available for a few minutes to half an hour.
What to look forward to
At a higher level AWS still has a lot of work to do around integrating their various data products together. Each database or system has its own preferred format, and as we experienced it with Parquet, converting from one format to the other is still a very painful process. We hope things will drastically improve with AWS Glue. This would allow us to potentially remove EMR/Hive from our architecture and use Spark SQL instead of Athena for various ETL tasks, while still not having to manage or optimize any clusters.
Overall, we are very impressed by how much the data engineering ecosystem — be it open source tools or online services built on top of them — has evolved. It is now possible to create and maintain an architecture supporting dozens of millions of users with very little manpower.