Are you sure you need a database? Strategy for a zero-downtime file-based repository

Published in

Clarity AI Tech

5 min readMay 8, 2024

In today’s busy world that continually pushes for productivity and results, too many applications are built with a design process that automatically assumes all the required tools and frameworks at the beginning. Databases are so common that they are axiomatically included in the architecture before the project even starts.

If you are a backend engineer who already worked on similar applications before and you want to have a different point of view on which option for data storage suits your requirements better, this article is for you.

Use case: calculating environmental impact for clients of a bank

Let’s imagine that for a banking application we need to provide information on the environmental impact of every monetary transaction performed by clients. The backend will have an API to serve this information (for example, CO2 total emissions). Our API will be read-only and it needs to be extremely fast because millions upon millions of bank transactions are expected to be computed daily.

For each transaction, the bank will provide us with the following information:

Company (for example, Zara)
Category (for example, Retail)
Country (for example, Spain)
Amount (for example, 10$)

The data will be updated periodically upon scheduled data releases. Each data release will generate two datasets:

CO2 total emissions per dollar for every company
CO2 total emissions per dollar for every category and country combination

The logic for the API will be as follows:

If the bank provides the company associated with the transaction, the response will be the company’s CO2 total emissions per dollar multiplied by its amount.
If not, the response will be the corresponding CO2 total emissions per dollar for a given category and country multiplied by its amount.

Common solution: using a database

A database provides a structured and organized approach to data management, using a database management system (DBMS) to facilitate efficient storage, retrieval and manipulation of data.

Advantages

Data integrity: consistency and accuracy are ensured through mechanisms such as constraints, relationships and transactions.
Scalability: designed to handle large volumes of data efficiently.
Concurrency: simultaneous access to data is managed to prevent conflicts.
Complex querying: powerful set of tools for analyzing data, enabling users to extract meaningful insights efficiently.

Disadvantages

Complexity: complex to set up and maintain.
Cost: implementation and management can be costly, particularly for high availability and performance.
Overhead: in terms of storage and computational resources.

What a backend engineer will normally do for this application is to create two tables in a database to store the required information: one for company emissions and another one for category and country emissions. However, a database could become a bottleneck in terms of performance. Normally it is not a problem but, in our scenario, data must be retrieved very fast so we need to ask ourselves, do we actually need a database? To answer this question, let’s compare it with an alternative to see which one fits better our requirements.

Alternative solution: zero-downtime file-based repository

A file-based repository is a traditional approach of storing data, organized and managed within individual files typically stored in a hierarchical structure within directories and subdirectories.

Considering the requirements of our application, a file-based repository provides the same advantages of a database:

Data integrity: ensured by using read-only files.
Scalability: a cloud storage solution (for example, AWS S3) has almost no limit on the amount of data you can store, providing great reading scalability on its own.
Concurrency: read-only files support multiple reads at the same time.
Complex querying: not needed because we will use the unique identifier of each data point (company ID, or category/country IDs) to search for a given company or category and country combination, respectively. All the data will be indexed in memory under performance-efficient data structures such as a Java Map that will allow us to retrieve any data point very quickly.

Furthermore, the disadvantages of a database are not shared by a file-based repository in this scenario:

Complexity: as simple as just one file per table, requiring no setup or maintenance (full control over how files are organized and structured).
Cost: setup and maintenance requires minimal resources, making it cost-effective.
Overhead: since no complex querying is needed, reading from a file already stored in memory is much faster than reading from a database.

Implementation

For every data release, the two files with the environmental information will be stored in a predefined path (for example, an AWS S3 bucket) accessible to the backend services in the development and production environments, respectively.

Since those two files are already loaded into memory, how will the backend know that a new data release is ongoing? Just using a data release identifier. Besides those two files, we will also need to store in a predefined path a different file with all (past and ongoing) data release identifiers.

The backend will just need to poll this file periodically to detect that a new data release has occurred, simply comparing the lastly generated release ID with the one currently being used by the backend. An alternative to polling, the backend could get notified (via whatever technology), attaching the release ID.

If that is the case, the backend will simply have to perform the following actions to achieve zero downtime:

Load into memory the two new files associated with the completed data release
Swap these files atomically in memory with the old ones
Remove the old files from memory to free up space

We must be able to have in memory, at the same time, the old and the new files, thus requiring twice as much memory during a short period of time. Any abrupt increase in the size of those files could cause the corresponding backend services to be killed due to running out of memory. In this regard, it is worth mentioning that the usage of binary formats like parquet or flatbuffers are preferred over json files because they require less memory.

This swapping must be atomic, i.e., there cannot be any temporal inconsistency across the backend. Rolling back will then be as easy as activating the previous data release. But how will the backend know which files to read for any data release identifier? A simple solution will be to have it embedded in the predefined path where those two files will be stored.

Lessons learned

Question assumptions: databases are axiomatically included in the architecture for an application. Always question whether a database is truly necessary.
Match requirements with a solution: know what your application requires before selecting a data storage solution. Choose a solution that fits your needs better.
Explore alternative solutions: a file-based repository, while less common, can offer in certain scenarios advantages over other solutions. Explore non-common alternatives that could fit your requirements better.

By applying these lessons, you can make better decisions on which data storage solution should be used, considering factors such as performance, scalability, concurrency, complexity and cost.

Acknowledgements

This post was co-written by squad risk at Clarity AI:

Juan Herrador Rodríguez (Team Lead)
Álvaro Martínez Hernández (Backend Engineer)
José Antonio Adame Fernández (Backend Engineer)
Sergio Juan Armero (Frontend Engineer)
Rita Belo (Data Engineer)
Tomás de Villanueva Romero Clavijo (Data Engineer)

Thanks for your insights and collaboration, valuable feedback and support throughout the process.