The story of the CouchDB mass storage system in English

Chirapath Saelim
11 min readMay 7, 2022

--

CouchDB logo from: https://www.dimagi.com/blog/what-every-developer-should-know-about-couchdb/

Database: A database system for storing large amounts of data.

Before understanding the system, CouchDB must understand databases, because CouchDB is one of the data storage tools. Structure (Structure data) is organized in different data files. But there is a relationship with each other on the same topic. To retrieve these data, I would not want to retrieve it from a chunk of data that was mixed with raw data and was very disorganized. Therefore, a data storage system has to be used to help. In the beginning, digital data storage was a traditional database management system. The traditional database is a centralized database where most of the data is collected. Steady-state and rarely change but run fast. The database is organized called a relational database management system (RDBMS) arranged in a tabular form. It uses SQL programming language for data retrieval and storage and is still popular, especially among large technology companies such as Microsoft’s Azure and Amazon SQL servers.

Relational DataBase Management System (RDBMS) from: https://www.youtube.com/watch?v=6BSlwKkgCYU

However, most of the data that RDBMS maintains is structured data, which in modern big data contains a large amount of incoming data, and most of it is raw, difficult to filter, and requires new database development.

Tradition database from: https://www.firstpost.com/tech/news-analysis/blockchains-versus-traditional-databases-understanding-difference-between-the-two-7269251.html
  • NoSQL Database: A new database format for use today.
NoSQL logo from: https://www.blazeclan.com/blog/detailed-outlook-no-sql-databases/

The new database system is a non-tabular database and stores data differently than relational tables. It uses a programming language called NoSQL (not only SQL). It supports many languages, including SQL, which is still the most commonly used and has features developed in response to Big data storage states:

1. Flexible schematics that make development happen faster and iterate better than ever. Flexible data models make NoSQL databases best suited for semi-structured and unstructured data.

2. NoSQL databases are often designed to scale out using distributed clusters of hardware (scale-out) instead of scaling up by adding expensive, high-performance servers (scale-up). Some Clouds manage this operation behind the scenes as a fully managed service.

3. NoSQL databases have been optimized for some data models. and access models that enable higher performance than trying to perform similar operations with relational databases.

4. NoSQL databases have functional APIs and purpose-generated data types for each corresponding data model, making work faster.

5. Most NoSQL databases are open source and can be easily deployed and adapted to suit different applications more easily by a developer of database storage systems with a group of developers to maintain.

from: https://aws.amazon.com/th/nosql/

NoSQL Database benefits from: https://medium.com/@R.Sumangala/nosql-ab44a979a831

There are 4 types of NoSQL database systems: 1. Document Database 2. Key-value Store 3. Wide Column Store 4. Graph Database, but our focus is Document Database.

SQL Database vs NoSQL Database from: https://www.digitalconnectmag.com/sql-versus-nosql-database-design-which-one-to-pick-for-maximum-cloud-data-storage-performance/

Document Storage: online document storage.

CouchDB stores data as Document Storage, a form of storage that retrieves and manages document or semi-structured data. And the storage time is written in a JSON file format. It has a REST architecture. Accessing the Web interface via HTTP protocol makes accessing the database accessible through the use of HTTP protocol and uses the Javascript language to write. A function for map-reduce to retrieve data as quickly and precisely as possible.

  • ACID Properties: DataBase properties that are suitable for use.
ACIP Properties from: https://www.bmc.com/blogs/acid-atomic-consistent-isolated-durable/

ACID is an acronym that refers to a set of four main properties that define a transaction: Atomicity, Consistency, Isolation, and Durability.

Atomicity: Each data organization instruction (to read, write, update, or delete) is considered a single unit. All commands are executed or no action is taken. This feature prevents data loss and corruption from occurring. For example, if your streaming source fails between streams.

Consistency: Ensures that data arrangements only change tables in a predefined and predictable manner. Transaction consistency ensures that your data corruption or errors do not cause unintended consequences for the integrity of your tables.

Isolation: When multiple users read and write from the same table all at once. Their transaction isolation will ensure that concurrent transactions do not interfere with or affect each other. Individual requests can occur as if they were one at a time, even though they actually happened at the same time.

Durability: It ensures that changes to your data made by successful transactions are recorded even in the event of a system failure.

CAP theorem: property theory of possible big data tools

In computer science, the CAP theorem, also known as Brewer’s theorem after computer scientist Eric Brewer, states that any distributed storage can be stored in the CAP theorem. Only two of the following three guarantees can be given.

Consistency — The data viewer must see the change in the data at the time of the oscillation or the same data even if it changes in other nodes.

2. Availability — The overall system operation still works normally even if some nodes are broken or unable to transmit data, but still respond when running.

3. Partition-Tolerance — Visual functionality still works under the lack of access to online communication and multi-node scalability.

When an Internet access failure or network partition failure occurs, the program of any database system must decide what to do next. There are two main causes.

Case 1 Cancels the operation and degrades the Availability of the system, but ensures that there is Consistency within the database on every node in the system, it will return an error or timeout if it cannot guarantee any information is current.

Case 2 Continues even if the crashed node is inaccessible and provides system Availability and executable but reduces Consistency within the database resulting in the retrieved data not always required to match. The system will always process the query and attempt to return the latest available version. Although it cannot be guaranteed that the data is current due to network partitioning.

Therefore, if the system wants to operate on a network between the nodes and has a strong network partition, the database developer needs to allocate the database system properties between Consistency and Availability to suit the intended use of the database system. But in the case of making the system without partitions, It can have Consistency and Availability properties at the same time, most traditional databases have these properties. If the system crashes, it will all crash.

In both cases, the current database system has two types of CAP theorem.

AP properties Database from:https://cryptographics.info/cryptographics/blockchain/cap-theorem/
  • AP(Availability over Consistency): Every request sent to the network will be responded back. Although the network cannot be guaranteed to be up-to-date due to network partitioning. (Failed node) Choosing availability over consistency for a globally distributed system will result in high availability. But system data is obsolete 99.99% of the time. Furthermore, no one can guarantee that the information returned is up to date.
CP properties Database from:https://cryptographics.info/cryptographics/blockchain/cap-theorem/
  • CP(Consistency over Availability): The system will return an error or timeout. If any data cannot be guaranteed to be current due to network partitioning. Selecting consistency over availability for a globally distributed system results in high accuracy. But they are usually unavailable 99.99% of the time.
The database classifies in CAP theorem from: https://www.researchgate.net/figure/CAP-theorem-with-databases-that-choose-CA-CP-and-AP_fig1_282519669/download

Characteristic and Function: introduces the overall features and working principles of CouchDB developed.

• CouchDB overview

Apache CouchDB is an open-source, documented NoSQL database that uses multiple formats and protocols to store, transfer, and process data. It uses JSON to store data, JavaScript is the query language using the MapReduce process. Faster Database Search and HTTP API were developed by the Apache Software Foundation to emphasize the ease of use of web pages for accessing the overall database system and improving or supplementing the data stored in the database. hold It was first released in 2005 and written in the Erlang programming language and became part of the Apache Software Foundation in 2008.

• ACID Properties in CouchDB

Consistency, when data in CouchDB is created once This information is not modified or overwritten, so CouchDB ensures that the database files are always in a consistent state.

CouchDB reads a Multi-Version Concurrent Control (MVCC) model because the client sees a consistent snapshot of the database from end-to-end read operations whenever the document is updated.

Whenever a document is updated, CouchDB flushes the disk. And the updated database header is written into two consecutive 4 Kb data files in that data file and then synchronously flushed to disk. Some updates during cleanup will be canceled.

If a failure occurs while sending headers A surviving copy of the previous identical header will remain. This ensures that all previously committed data are consistent. except for the header area, There is no need for conformance checks or corrections after a fault or power failure.

• CAP theorem in CouchDB

In CouchDB, the CAP theorem-based features focus on Availability and Partition-Tolerance. It was designed according to the concept of “Optimistic concurrency” which increases the burden on applications for managing data consistency. The browsing system does not lock the database object in writing. This means that conflicts or locking mechanisms for compliance have to be resolved by the application developer. This puts an additional burden on the development team. and adds unnecessary complexity to their application code, but does not leave Consistency neglected by Consistency properties.

• Consistency of CouchDB

  • Local Consistency

To work the steaming node of the CouchDB Database System The CouchDB API functionality is designed to provide a convenient but thin wrapper around the database core. When considering the structure of the database core in detail, the Consistency characteristics of

At the heart of CouchDB is a powerful B-tree storage engine. The B-tree is a sorted data structure that enables search, insertion, and deletion at algorithmic size [log(x) sec].

In CouchDB, we access documents and view results by key or key range. This is a direct mapping to the underlying operations performed on CouchDB’s B-tree storage engine. In addition to inserting and updating documents, it greatly enhances the performance of subnodes by increasing The speed of operation of the nodes in the database and the ability to partition the data forwarded on multiple nodes of the data pane. without affecting our ability to search each node

• No Locking

A table in a relational database is a single data structure. If you want to modify a table, such as updating a row, the database system must ensure that no one attempts to update that row. and no one can read from that row while it is updating. The most common way to handle this is using something called a lock. If multiple users want to access the table The first user gets a lock and the user who comes in after the first has to wait for it. When the first visitor’s request is processed The next user will be granted access while the remaining users continue to wait. This goes on and on and on. Even if they come in parallel, this results in a huge loss of server processing power. under high workload, Relational databases can spend more time figuring out who is allowed to do what, and in what order, than doing any actual work.

Because of this problem, CouchDB can use a multiple concurrent control (MVCC) method to manage concurrent database access. This allows all users to use it at the highest speed all the time. even under high load conditions, Parallel work requests The efficiency of the database system is fully utilized.

Documents in CouchDB are usually versioned. As in a normal version control system, such as Subversion, to change the values ​​in a document it is necessary to create a new version of the document and overwrite the old version. After doing this, there will be two versions of the same document. Old and New Versions MVCC improves performance by allowing users to access it at any time even during the database changes or improvements, where early users will get the old version of the dataset. And after inactivity, the data that needs to be changed will be updated and completely overwrite the old version, which users who come to view the data after writing will get the new version instead, which is the case The key point CouchDB has chosen is Availability over Consistency, which allows users to view data almost at any time without waiting for updates. However, the data obtained will not be the same between database additions or updates and after write operations.

• Validation

CouchDB can inspect documents using JavaScript functions similar to those used for MapReduce. Whenever an attempt is made to modify a document, CouchDB sends a copy of the existing document. a copy of the new document and additional data sets such as user authentication details. The review function now offers the opportunity to approve or reject updates.

With improved CouchDB validation principles, it eliminates the CPU work that would otherwise be wasted in sequencing objects in grafts from SQL to domain objects. and use those objects for application-level validation.

• Distributed Consistency

Maintaining consistency within a single database node is relatively easy for most databases. The real problem starts when you try to maintain consistency between multiple database servers. For traditional relational databases This is a very complex problem with a whole book to solve many complex cases, but for modern database use cases can use Multimaster Master/Slave partitioning, segmentation, write-through cache, and other complex techniques Can be all sorts of solutions to have Consistency in the database on multiple nodes through a large number of users to modify or add through a single node and others must change accordingly.

• Incremental Replication

This is because CouchDB operations occur within the context of a single document. If more than one database node is used, CouchDB provides eventual consistency between databases using incremental data creation and storage. This is the process by which document changes are periodically copied between servers. The system uses a method called shared nothing on the database cluster. where each node is independent and self-reliant and leaves no single point of conflict across the system.

CouchDB’s data generation and storage system come with automatic conflict detection and resolution when CouchDB detects that a document has changed in both databases. will flag this document as conflicting. Same as in normal version control system. where two versions of the document conflict between creation and storage. The qualifying version of the database system set by the administrator is saved as the latest version in the history of the document. but will also save this version as the previous version in the document’s history. so you can access it if you want This happens automatically and regularly. Depends on how database administration and user management want to do with two data inserted at the same time, either choose one or combine the two, or let the system manage itself.

Reference:

  1. http://www.student.chula.ac.th/~59370600/Database.html
  2. https://www.techtarget.com/searchdatamanagement/definition/RDBMS-relational-database-management-system
  3. https://aws.amazon.com/th/nosql/
  4. https://databricks.com/glossary/acid-transactions#:~:text=properties%3A%20Atomicity%2C%20Consistency%2C%20Isolation,Consistency%2C%20Isolation%2C%20and%20Durability.
  5. https://cryptographics.info/cryptographics/blockchain/cap-theorem/
  6. Anders, C. J., Lehnardt, J. and Slater, N. (2010). CouchDB: The Definitive Guide 1st edition(pp. 1–17). North, Sebastopol, CA. O’Reilly Media, Inc.. e-book link: https://www.oreilly.com/library/view/couchdb-the-definitive/9780596158156/

--

--

Chirapath Saelim
0 Followers

For data science architect in Chula for one topic first, then I will make some topics later on.