From MongoDB to AWS Redshift: a practical guide (2/2)

This post is part of a series of two posts about our migration from MongoDB to an AWS stack for our analytics feature. Head here to read the introduction post and know why we started this migration.

Here is the full ingestion and aggregation mechanism we will present in this blog post.

Database Design

Redshift can be very powerful if you take the time to properly design your database schema. There is a nice tutorial made by AWS to learn how to “Tune Table Design”. The documentation helps the reader designing a coherent and adapted database schema by explaining important concepts like distribution key and sort key. In order to find the best schema (types, compression, distribution key and sort key) for our data, we made some hypothesis regarding the schema. We exported our production data and inflated it 10 times in order to have a data distribution and quality similar to our production, but with a higher volume. Then, we ran different queries with different columns compression encoding, data distribution, etc. We automated this using a ruby script ran on EC2 and compiled the results into a spreadsheet which helped us choose the best schema.

Do we keep the same “tables” we had on MongoDB?

Initially, our MongoDB instance used to store objects with different meanings (web and native connections, apps downloads, …) in the same collection. Time showed us that it was not the smartest move. In order to ease the filtering process of data aggregation, we split our data into two tables, specified different columns and chose different sort keys for each table (INTERLEAVED SORTKEY and SORTKEY) following results of our previous benchmarks.

Redshift ingestion

AWS recommands using Kinesis Firehose to ingest data into Redshift. It’s advertized as the easiest solution to insert data into AWS so that’s what we used.

Because we wanted to have the ability to perform more actions on the incoming data stream in the future, we jumped to Kinesis Stream and used a lambda developed by Ian Meyers from AWS. We forked the original lambda and made some changes to fit the data format provided by our backend adapter.

To handle the migration, we did what any sensible developer would do: we decided to run the old and new stacks in parallel and compared the data. That way we would have total confidence in our new stack. With that goal in mind, we wrote the backend adapter to insert data inside Redshift. For every data inserted into MongoDB, we inserted it into Redshift as well.

We then wrote a ruby script to compare the data in both stacks. Ran on Jenkins, it takes 6 seconds to complete and publishes the results in a Slack channel every day.

First issue

Duplicates?

The ruby script shed light on a small amount of duplication in our Redshift database. The data were duplicated when Kinesis shards were read. We get around 0.015% row duplication. This is related to the “at least once” mechanism of Kinesis stream. There are several ways to handle this problem:

  • Inside the lambda: Keep a list of processed data (using a UUID) in the Kinesis consumer to skip already processed data
  • In Redshift : Have a periodic processing task removing duplicates
  • Leave duplicates in Redshift but filter them out when querying

To us, the sanest solution would be to have a clean database without any duplicates.

Sanitize incoming data inside Redshift

We chose to deal with duplicates before inserting them into the final table. Thanks to Brent Nash who mentions this solution in this blog post!

Here is our process:

  1. We receive data in Kinesis
  2. They are forwarded to Firehose using ‘lambda stream to firehose’
  3. Firehose buffers incoming data before delivering it to Amazon S3
  4. That creation event triggers a lambda that will run a Redshift COPY command with S3 data into a temporary table temp_table
  5. The same lambda INSERT deduplicated data into the final table using a LEFT JOIN and drop the temp_table.

Aggregates

Redshift isn’t meant to be directly exposed to clients. It’s a common pattern to have another database in front of it when you want to expose data to customers.

At the beginning we started a POC using DynamoDB

It was quite complicated but we wanted to use DynamoDB for easy migrations. But the biggest problem was changing writing provision. As you can see on the schema we wanted to increase the write IO capacity of DynamoDB before inserting aggregated data, then decrease it. It would have helped us cut the cost of DynamoDB a lot. But DynamoDB has an annoying limitation: you can increase ReadCapacityUnits or WriteCapacityUnits of the table as often as necessary but…

no more than four times per table in a single UTC calendar day.

This limitation with all the complexity of the lambdas led us to seek another solution.

Magic dblink

Thanks to Tony Gibbs who wrote this blog post : JOIN Amazon Redshift AND Amazon RDS PostgreSQL WITH dblink

We finally used RDS Postgres with foreign-data wrapper and dblink extensions and created materialized views with aggregated logs. It’s much faster and very easy to use.

The table is created with an unique index to accept concurrent refreshes. Refresh stats is as easy as :

REFRESH MATERIALIZED VIEW CONCURRENTLY my_table;

Stats refresher

We created a lambda that runs every 15 minutes to refresh stats. It takes a parameter (the name of the materialized view to refresh), and runs the REFRESH SQL query on the chosen materialized views.

How do we extract data from MongoDB and insert them into Redshift

MongoDB provides two export formats: JSON and CSV. Hopefully, Redshift allows data insertion from raw CSV and JSON files. We wrote a ruby script that uses mongoexport and uploaded the exported CSV to S3.

Redshift implements the COPY operation, which allows us to import data into a Redshift cluster from a CSV file stored on AWS S3. We COPYed our CSV file into a temp table and then inserted data into specific tables. The slowest step was the mongoexport.

Tips: If you are importing data into Redshift that comes from another database hosted on AWS (which is the case for our MongoDB provider), run the export/import script from an EC2 instance to be physically closer to both servers.

How do we copy the production data into our new stack?

Before our migration, the analytics data were inserted into MongoDB with Sidekiq workers. In order to copy data from one stack to the other, we stopped the Sidekiq queue that was queueing all the insertion jobs, ran the copying script on EC2 and once the export/import operation was done, let Sidekiq process the queued jobs.

Conclusion

This migration leads us to a cheaper and more efficient way to generate analytics for our customers. Using the postgres extensions db_link and postgres_fdw, it’s very easy to create new aggregated stats from Redshift. Lambdas offer a cheap and modular way to consume Kinesis streams and run database tasks. We are confident about the future and the growth of our analytics stack.

We also enjoy the dashboard that each AWS component offers, especially Redshift’s, which enables us to dig into slow queries and verify database health status. The AWS CLI helped us a lot to build and deploy lambdas, a real time saver.

We’ve contacted the support for issues related to Redshift and it was also very helpful.

As these technologies are quite new, there are still a few drawbacks that we hope will be addressed in the future:

  • Kinesis : it offers a robust mechanism that avoids missing data but it can’t prevent duplicated data. We should be able to more easily validate the uniqueness of the Kinesis output or Kinesis consumer input. Our duplicates removal mechanism costs us money as we have to run a dedicated lambda for that..
  • Lambda timeouts: We still have random timeout on our lambdas without knowing why. The amount of failures is low, around 0.0045%, but should be equal to 0.
  • Cloudwatch alarm: actually it’s not possible to get alarms that combine multiple metrics. In our case we would like to report all alarm for a group of lambda (per environment).
  • Lambda errors reporting: You have to build your own system if you want to be notified when a lambda fails. It still misses a dedicated errors reporting system to have the complete stacktrace of a failing lambda. For now, we only report errors in Slack using Zapier and Cloudwatch Metric filters alarms. Also it should split the errors into two groups (timeouts and errors) that occurred inside the lambda. Right now, lambda errors are all grouped together, which is not ideal.
  • Redshift: Some useful SQL commands are missing from the syntax (DISTINCT ON, GENERATE_SERIES()..).

Now it takes around 6s to make our aggregate tasks. It’s much easier to change and add new aggregated stats and it costs around 30% less.

Thanks a lot to Ian Meyers, Brent Nash, Tony Gibbs, Vitaly Tomilov (pg-promise) and AWS Support.

This article was written by Appaloosa’s dev team:
Benoît Tigeot, Christophe Valentin, Alexandre Ignjatovic

Want to be part of Appaloosa? Head to Welcome to the jungle.