At Appaloosa, we have been providing stats on apps downloads and usage for more than two years. When we decided to add analytics to our offering, we had to build an architecture from the ground up. Unsure of the future requirements (we had never worked with analytics before), we decided not to use our existing PostgreSQL database. We were seduced by the flexibility offered by some NoSQL databases: no migration needed for the schema evolution, a welcomed revolution compared to our PostgreSQL database. That’s why we decided to use MongoDB, a Document Oriented database, to store our analytics data. Like many startups do during their lifetime, we made a bad architecture choice, and we are still paying the cost today.
Disclaimer: This is not a MongoDB bashing article. We did a bad choice, and we are pretty sure that MongoDB is perfect for other use cases.
When an admin accesses an app’s details page, we display important usage data to him (unique downloads, downloads per version of an app, downloads for all versions, etc).
At first, we implemented the feature naively: when the page loaded, we triggered asynchronous JS requests to our Rails backend, which directly request our MongoDB data. It worked flawlessly for a year.
But as our MongoDB table got bigger and bigger, some calls started to timeout (Heroku’s infamous 30 seconds), and quickly, all of them. To solve that, we decided to aggregate the stats every hour into a proxy column in a PostgreSQL table. It bought us some time, but I’m sure you get where this is going… :)
A few months ago, we realized that our aggregation jobs were starting to take more and more time. We knew it would be a matter of months before they would take more time than our stats refresh rate (1 hour). We launched a long running migration project to completely replace our flawed architecture by one that would scale!
This article is the first of series of 2 articles. We will present why we chose our new architecture and how we managed our migration.
Our needs for the new architecture
Our experience with MongoDB gave us a good insight on what we needed to take into consideration before building a new analytics architecture:
Our main issue is that we are a small team, so we do not have the chance to count a DBA or a devops. That means we would have to turn to the cloud for our solution. That’s okay, we already heavily rely on it for our assets (AWS S3) and our Rails app hosting (Heroku, itself relying on AWS EC2).
Our database is going to grow a lot and quickly. When we were on MongoDB, our implementations became too slow in less than a year. We hence need a scalable database, ideally without having to specifically say “I need one more instance”.
The solution will have to remain cost effective, so this rules out expensive licences.
Bandwidth and computing power
We do not know how much data will be generated in one year (it really depends on our clients’ apps usage), so the solution has to be able to import data at a high rate. Additionally, as our dataset is going to get bigger, our aggregates’ execution time is going to take more and more time, so the servers will have to provide sufficient computing power.
The chosen solution
After reading articles and documentations, we decided to migrate to AWS Redshift, thinking it answers best to our needs presented above. Redshift is a distributed SQL columnar storage implementing the Massively Parallelized Processes pattern. It also uses a SQL syntax, which makes it a robust, fast and easy to use BI tool.
As it’s using a syntax that is relatively close to PostgreSQL’s, you can connect to a Redshift cluster with a standard PG adapter, and use it (for the most part) like a regular SQL database.
Assuming you are following AWS best practices, you can set up a robust and fast data warehouse with no DBA in a few hours. The only thing to keep in mind is the noticeable performance degradation that occurs if you execute several queries simultaneously: once your data is collected and aggregated, you need to store it in another storage that is able to sustain a high query load. If you want to stay in an AWS environment, you can turn to a RDS or a DynamoDB database.
Redshift fits our need for scalability. As a clustered technology, it allow us to easily update the type of node we are using if needed.
Bandwidth and computing power
Redshift is based on a leader node and compute nodes that run closely for faster communications. When you properly design your tables, queries and data loading will use parallel processing across nodes for better performance. We’ll discuss that in part two of this series.
It’s difficult to tell at this point the exact cost of the switch to Redshift but we estimate it will cost around 40% less than our MongoDB stack. But for the moment the highest cost is the human cost. The migration required one to two engineers to work for several weeks.
A long road ahead
In the upcoming articles of this series, we will tackle the following subjects:
- Prepare data replication from MongoDB to Redshift
- Do we keep the same “tables” as MongoDB?
- How do we extract data from MongoDB and insert them into Redshift
- How do we migrate without downtime?
- Inserting data into Redshift
- How do we insert data from our backend into Redshift?
- Limitations of Firehose
- Saving Redshift aggregates and requesting data
- What database should we choose to save our Redshift aggregates?
- Probably lot’s of surprises along the way
You should subscribe to our blog to be notified whenever a new article is available!
Part 2 of the article is available: From MongoDB to AWS Redshift: a practical guide (2/2)
Want to be part of Appaloosa? Head to Welcome to the jungle.