App Directory: MongoDB to SQL
The Hootsuite App Directory is a collection of extensions and applications which Hootsuite users can add to their Hootsuite dashboard to create a customized experience. Since its launch in 2011, it has been used by millions of Hootsuite customers. In the past 6 years, it has accumulated gigabytes of data, from information about the apps that people can install to information about which apps are installed for each user. As more apps are released into the app directory and more customers install and use apps, it becomes all the more necessary to have a database which is easy to maintain and scale. The requirements for our App Directory database are:
- The ability to handle relational data.
- The ability to easily ensure data integrity.
- The ability to handle many simultaneous read requests.
When the Hootsuite App Directory was introduced in 2011, we did not know how it would develop over the next six years. MongoDB was chosen to hold the app directory data, at the time MongoDB was Hootsuite’s primary database. MongoDB provided us with much needed flexibility during the early stages of the App Directory. We were able to store data in Mongo without focussing strictly on the structure of the data, allowing us to quickly handle rapidly changing requirements as we experimented with various apps and integrations.
We have previously written about our move from a PHP monolith to a microservice architecture. As a part of the larger-scale migration project, we are moving the App Directory logic from the monolith into a Scala microservice. Along with the logic, we are moving the data from the MongoDB Database connected to the PHP monolith, into a new database connected to our microservice. This has given us an opportunity to revisit our database choice and the data’s schema.
Our App Directory has matured since its launch and we have a more stable data model than we had when the project began. This stability allows the model to be represented well by a schema. The model consists of various relationships which can be used to reduce the complexity of the business layer.
We currently have these collections among others in our MongoDB:
- An App collection which contains data related to Apps.
- An InstalledApp collection which contains data related to installed Apps.
- An InstalledAppComponent collection which contains information related to external developer.
By taking the advantage of features built into Mongo like document embedding and document references, we could possibly have used MongoDB much more efficiently. These techniques could be used with one-to-many relationships (for more details refer to this link). An example of this relationship in our database in our database is between App and InstalledApp:
We have denormalized InstalledApp and InstalledAppComponent, and referenced the InstalledApp to App collection.
The complexity of the queries increases dramatically when it comes to more involved relationships such as multi-layer hierarchy or many-to-many relationships. You can see this in the example above. When the document starts growing, performing update or search operations on the data becomes more difficult. Though it is easier to update documents using document referencing, it would require us to use multiple queries to retrieve the related data. In the end this leaves us with both ugly documents and highly convoluted queries.
In newer versions of MongoDB, they have introduced “lookup”, a feature analogous to SQL left outer joins. There are two main reasons why we are reluctant to use lookup. Firstly, lookup is a relatively new feature and would require us to a upgrade our Mongo version. Secondly, it only performs left outer joins, so performing inner joins and full joins would still result in messy, hard to maintain code.
The complex Mongo query above is expressed relatively simply in SQL. Here is the same query, retrieving the total number of app installs for each app developed by a certain app provider:
The above query shows how easily relations can be handled in MySQL. These are the benefits of using a robust query language like SQL. It allows for operations such as joining, filtering and grouping. MySQL is a relational database and our data model consists of complex relations, therefore we feel that MySQL is more suitable database for our use case in the Hootsuite App Directory.
Data integrity is defined as the overall completeness, accuracy, and consistency of the data. This is highly valuable to us, as it increases the maintainability, reusability, stability, and performance of the service.
MongoDB follows the BASE approach, which sacrifices consistency in favor of making the database more partition tolerant. As a result, performing operations on more than one document at once may lead to corrupted data in the database. MongoDB provides two phase commits which allows transactions to be performed similar to transactions in SQL. If there is an error during the transaction a rollback is performed. One important difference from SQL is that a user can still access the intermediate state while the operation is in progress. The Mongo documentation warns:
It is possible for applications to return intermediate data at intermediate points during the two-phase commit or rollback. 
This is not the case with SQL as it adheres to ACID, having the properties of Atomicity, Consistency, Isolation, and Durability. This ensures that the data always remains in a valid state, both consistent and accurate.
Being schema-less and without the presence of referential integrity, MongoDB shifts the burden of maintaining data consistency onto the developers. A side-effect of the lack of strict schema and referential integrity is that bugs in code can result in inconsistencies in the database. These inconsistencies may not surface as errors until something breaks. It is certainly possible to prevent data inconsistencies in database by thoroughly designing and testing the software that reads from and writes to it. However, as we are moving from PHP to Scala, we would not only have to rewrite all the models, but we would also have to write extra code to ensure consistency. We reasoned that this would slow down the migration process as well as adding to the difficulty of maintaining the code. With the relations in our data, we would like to have referential integrity so that we don’t create orphaned data. Implementing referential integrity in MongoDB would require the following steps:
Inserting an InstalledApp
- Insert InstallApp
- Search Mongo for correct App using the appId.
- Insert the installedApp in the installedApp array in App collection
Deleting an App
- Fetch the the right App
- Get all the InstalledApp ids
- Remove all of InstalledApps
- Delete the App from App collection
There are many other scenarios that would have to be covered for our use cases, and there are only 3 collections! To make things worse, if anything fails, we would end up having faulty data.
MySQL requires us to define a schema, declare data types, nullable fields, etc. We can declare foreign keys (referential integrity) while creating the schema itself. The schema does not reside in the business layer, it is part of the database itself. If the data does not agree with the defined schema, it will not be added to the database. This lessens the burden on the developer to implement logic ensuring the consistency and correctness of the data.
Our service receives many simultaneous read requests, and also relies heavily on relational data. We need a database that performs very well under such conditions. General speaking, Mongo outperforms MySQL when the service is exposed to high volume of write requests. This is because things like referential integrity and ACID compliance have a cost. Being horizontally scalable, MongoDB can deal with an even higher volume of requests by taking advantage of the benefits of distributed systems.
When it comes to read requests, especially when dealing with relations, MySQL often outperforms MongoDB. Moreover, with proper indexing, the performance of operations, such as joins, can be improved drastically in MySQL. One reason why MongoDB is slower in these cases is that some of the logic handling the relations resides in the business layer, which is not the case with MySQL. Because we experience a high volume of read requests, allowing for slower write requests in favor of faster read requests is a reasonable trade off.
Although MySQL is generally vertically scalable, there are still ways to make it horizontally scalable. Features like replication, and products like dbShards can be used if needed. With our requirements, replication is a good option as we can balance high read requests between various slave MySQL databases.
Anyone who has done a data migration knows that it is not an easy task. We are dealing with customer data and we want to ensure that the integrity of that data is maintained throughout the entire process. Our strategy is to write to both our MongoDB and MySQL databases, and then compare if the data matches. For historical data, we use a migration script which exports the data from mongo and then imports it into SQL using the new schema. Any mismatches are fixed by the team, either by adding more validation checks in the business layer or by updating the migration script.
This migration project has given us some good insights into our legacy code. This has enabled us to write much more efficient and more maintainable code for our microservice. It is a win-win situation for us, we are storing clean data in the database and we have higher quality code.
Remarks and conclusion
In the end, both MongoDB and MySQL have their strengths and weaknesses. However, the differences between MongoDB and MySQL are lessening with new features that continue to be released. Newer versions of MongoDB can use join-like operations and MySQL now has the ability to store JSON data. Along with that, there are many integrations available which can be used to improve the performance of MySQL, or to handle transactions effectively in MongoDB. It ultimately depends upon the data, and what do you want to do with that data to determine which database is the right choice.
For our App Directory service we have a well defined relational data model. We want to ensure that the principles of data integrity are offered by the database itself, and that it can also handle many simultaneous read requests. These requirements led us to choose MySQL for our new App Directory service database.
Shoutout to Neil, Sim, Jody, Steve and Isha for helping me with the blog post.
Preetkaran Rawal is a Co-op Software Developer on the Developer Products team. He currently attends University of Waterloo for Computer Engineering.