At the beginning of every project, we have to face some very important choices. One of these choices is the choice of the right database which will be tailored to our needs.
In this article, I will show you how to find the best one for our needs, basing on our experience in one of POC projects (https://en.wikipedia.org/wiki/Proof_of_concept). I will also show you a comparison of these databases — MongoDB, CosmosDB and SQL Server.
A simple and easy project, is it really?
We received a simple POC project from our customer. Project was already written, it contained three parts — frontend in React, backend in .NET and the ETL application (https://pl.wikipedia.org/wiki/ETL). We had to do a simple research to see how the process is working, and we also found out that the heart of our application will be this ETL part, as the application will be designed to show complicated and unstructured big volumes data in a user-friendly way.
At this point, we found out that the ETL is downloading very big volumes of data, we will receive multiple files every day, and it is possible that some of them will have hundreds of thousands records and it’s all stored in XML archives. Afterwards those files will be transformed into JSON files and then stored in the database. Students chose to use the NoSQL database — MongoDB.
Is NoSQL the best way to go?
After seeing such a large amount of data — TB, we thought immediately that NoSQL is the right way to go with like those students did. NoSQL databases are designed to store large amounts of unstructured data, in our case JSONs. We have decided to implement MongoDB in our ETL and backend applications.
Locally everything worked pretty good, uploading to the local database was quite fast, viewing the data from our application was also acceptable but locally we didn’t load a full set of data, as multi-terabyte drives in laptops are not that common, it was only the data from a few weeks. The good point of this database was the possibility to run it in a docker container, which was great for local development. Additionally, MongoDB has a really good looking UI for the data browsing, querying and management which is Mongo Compass(https://docs.mongodb.com/compass/master/).
Till this point our approach seemed promising, the next point was to move our solution into the web. There we encountered some problems.
Our customer decided to use Azure Cloud. Unfortunately we found out that Azure did not provide service for hosting the MongoDB database, they have their own NoSQL solutions. We had to think about what we could do in this situation. There were a few workaround possibilities for the problem we faced, but we had to ask ourselves a few very important questions.
An example of this workarounds was setting up MongoDB in Azure Kubernetes Service, but this solution was not acceptable. Why? When we are using workarounds like this, we are losing the maintainability of Azure services. For example we would probably need an administrator for managing the database, setting it up, managing the replication, redundancy and backups.
Despite those problems, we tried to test this solution for a bit, we loaded the data from a few weeks. Extracting this gigabytes of data and trying to display it on the frontend side managed to freeze our browser. Considering these maintenance issues, we decided to try out the NoSQL service provided by Azure.
- fast uploading to local database
- fast reading from local database
- easy to set up with docker containers
- nice database UI client — MongoDB Compass
- no service on Azure
- workarounds require additional administration of database
- workarounds do not provide replication, redundancy and backups.
Perfect solution to all our problems?
In the Azure Portal we found a NoSQL database, which looked like the perfect solution to all the problems — CosmosDB. CosmosDB is a very interesting database, Microsoft is providing a database emulator to run on local machines for development purposes, it also has a UI for data exploration. This service was supposed to be the best match for our project, it had data replication, redundancy, automatic backups so it did not require any additional administration. Unfortunately, as we found out, CosmosDB also had some cons.
After a quick change of data access layers in our applications, we were ready to test our solution in the cloud. After the first tests we found out that throughput management in this database is crucial. It is calculated in RU/s which are Request Units per Second. We implemented changing the throughput from our API to dynamically increase the limits when the ETL application is running, and decreasing when it’s done, then we tested our solution again. A few of the first files loaded smoothly to the database, but then everything slowed down terribly in the ETL process. Reading from this database was acceptable, considering that we don’t use an ETL application at the same time.
We decided to do some performance optimizations in our ETL application, and we also extended the logging system, in order to monitor every process. Optimization of the application didn’t help. We tested multiple methods of loading data into our database, synchronous, asynchronous, single records, batches, nothing worked. From our perspective, we couldn’t optimize our application more, the times of extracting and transforming steps were optimal. At this point, we knew that the problem was CosmosDB.
Loading the batches of records — around 10k records, were exposing some additional problems and exceptions, for example massive amounts of forbidden and timeout responses. We could see that the data was getting “stuck” the longer the process was going. We have estimated the ETL process for a few years, and it was not acceptable, as this was a daily job and it was supposed to finish in hours.
It was the time to check what exactly was going on with Azure, as we tested our solution with, as we expected, a large amount of throughput — 20k RU/s, which was supposed to solve the problem. After the investigation, we found out that the more GB or TB we store in the database, the more throughput will be required. As we read on Azure:
Provisioned throughput (RU/s) scales relative to the amount of transactional storage at a rate of 10 RU/s for every 1 GB storage. To estimate your storage requirement, use the capacity planner tool, and ensure you’ve provisioned enough throughput to meet your storage needs.
That was a very big issue of CosmosDB. We saw that indexes are getting created on the database in the background, which we also disabled afterwards. We also used our ETL application to manage throughput and increase it up to even 50k RU/s. All these efforts didn’t help us, as after a short speed up of upload, everything slowed down again. Uploading data became almost impossible, and reading was quite slow, not on an acceptable level. At the time we were doing this, CosmosDB didn’t have the option for running it serverless (https://docs.microsoft.com/pl-pl/azure/cosmos-db/serverless). Dynamic scaling of the throughput was around 3 times more expensive than the static throughput. What could we do next?
- replication, redundancy and backups
- no excessive administration required
- quick initial upload from Azure VM, might be good for lower uploading volumes
- throughput limitation
- hard throughput management for data upload
- random exceptions from database
- unexpected RU/s scaling per 1 GB of storage
Old good SQL
Finally, we decided to use classic SQL. We chose Azure SQL Hyperscale databases. This solution scales very well, additionally it provides instant backups based on snapshots stored on Azure Blob Storage and very quick data restoration (https://docs.microsoft.com/pl-pl/azure/azure-sql/database/service-tier-hyperscale). SQL wasn’t our first choice due to very large volumes of unstructured JSON files. Due to these conditions we had to rethink the database design, as a trivial solution probably wouldn’t work in this case.
At first, we decided to create 3 tables. First table was prepared as “write only”, it was supposed to give us the possibility to load data quickly. It had no keys or indexes and very few columns — one with entire JSON parts, and others to identify this JSON for further processing. Communication between this table in the database and the VM which was performing our ETL went smoothly and very quickly, we didn’t experience any more errors like forbidden or timeout. Entire set of data was loaded to SQL Server very quickly, so we could tell that the first step was a great success. As we had only a very large table without keys, we weren’t able to read from this, as it took way too long. We edecided to add one more step to our ETL process — data transformation. Thanks to this solution, we were able to automatically transform data to the next tables in our database, which were optimized for reading.
This is what the initial structure looked like. We have a table called “RAW” with our JSON files. Next we created a History table, where we extracted all the data from JSON files via SQL query(https://docs.microsoft.com/en-us/sql/relational-databases/json/convert-json-data-to-rows-and-columns-with-openjson-sql-server?view=sql-server-ver15), we also have an additional column to store this JSON in case it’s needed for further processing. The last table was the “Current”, where we took data from the History table and aggregated it. Both of the tables had keys, clustered and non clustered indexes (https://docs.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described?view=sql-server-ver15), which we optimized with query analysis and execution plans in SQL Server Management Studio (https://docs.microsoft.com/en-us/sql/relational-databases/performance/analyze-an-actual-execution-plan?view=sql-server-ver15).
This structure gave us the possibility to quickly upload data to the database, transform it in acceptable time, and read it. Unfortunately, the more data we had, the harder it was to read it from the API, and data transformation started to take a very long time and wasn’t at acceptable level. At this point we had to perform some structure optimizations to make times more acceptable.
This is the second design of our database. In this solution, we have two additional tables — StagingHistory and StagingCurrent. The most important information about RAW, StagingHistory and StagingCurrent is that it is getting cleaned after every ETL application run, so it is not holding large amounts of data anymore. The process now is inserting the data into the RAW table, afterwards it is parsing JSON files into StagingHistory. StagingHistory is inserting data into History as it is. Then we are aggregating data from StagingHistory, and we are inserting the most recent, updated record into the StagingCurrent table. Most important is that we keep using these small chunks of data from one ETL processing. Then we are upserting data into Current table, to keep our records updated with the newest data (https://docs.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15). This process was uploading and transforming data quickly, there was a huge improvement compared to the previous structure, as operations on small volumes of data are much quicker. The solution worked great for these issues, but our issue with reading data from the database was still quite serious. Further structure improvements were required.
Here we have the final structure of our database. Thanks to automation on the side of transforming the data via SQL queries, we had the possibility to introduce a few new tables created from this process. We decided to create additional prefiltered tables, for the most common use cases — we had tables for active and inactive elements, and the most recent elements from the last month. We have also decided to create helper tables, which were distinct data extracted from our database, and used in our filters on the frontend side. This solution was most optimal for our application for both, reading and writing the data. Running time of our ETL application lowered from years to hours, compared to the previous databases and solutions. This time was very acceptable, also reading became great, we had maximum delays counted in milliseconds.
- fast reading the data
- fast writing the data
- replication, redundancy and backups
- no excessive administration required
- complicated structure
After quite a long journey through the above mentioned databases, we decided to choose SQL Server. This database gave us all that we needed. We had all the backups, redundancy and replication without additional administration of our database. Additionally, finally we were able to load and read data quickly and at the same time. Complicated structure of the database wasn’t that much of a problem, as it wasn’t supposed to change anytime soon. This solution is currently running on production, we have loaded around 3,9TB of data into the database and it’s still great. Experience is the best teacher, and now we know that the most obvious solution is not necessarily the best one.