NoSQL, MongoDB, HiveQL
One of the problems with tradition SQL -based RDBMS was explosion of data — the size and number of available data sets have grown rapidly. RDBMS have difficulty processing and analyzing big data. The processing and analysis of big data may require “massively parallel software running on tens, hundreds, or even thousands of servers”. What qualifies as “big data” varies depending on the capabilities of those analyzing it and their tools. Furthermore, expanding capabilities make big data a moving target. “For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.”
Traditional RDBMS has struggle to handle big data. At 2010-s the only solution from major vendors, such as Oracle, for handling big data was to buy and manage DB-cluster from them. It was very expensive back them, and pretty soon the cost that you’re paying for this solution didn’t pay off.
Another problem that stroke in 2010s was unstructured data. You can think about logs, that you now want to store in DB. It doesn’t have some fixed structure. It is true, that for the given application/web service you can, after some research, define correct DB Schema. Typically, nevertheless, you will have more than 1 application and again the time that you’re spend to fulfill correct DB schema definition doesn’t worth it. It is far better, to store the data with some pretty rough preprocessing (you can assume, for example, that you do have the timestamp on the logs) and then to define the schema when you will query the data. It requires more efforts when you will write query to fetch the data, but if you’re doing this irregularly (this is usually the case with logs — you’re defining ad hoc select only if you have some problem to investigate), this should be preferable solution. Of course, nothing stop to store the application logs as row data and have specialized ETL process per application to split it apart and to put to the designate DB table with properly defined schema. Again, usually, this will be only the waste of time.
The point with unstructured data and RDBMS, they are exceptionally bad for dealing with it. Typically, you have something like BLOB (Binary Large Object) and CLOB (Characters Large Object) to store it. But then you can easily manipulate it to extract the relevant part of it. Also from storage point of view, if you have big amount of unstructured data they are not efficient for it.
So, NoSQL has emerged.
I will explain what does NoSQL means and provide with 2 specific examples — MongoDB and Apache Hive.
NoSQL (originally referring to “non-SQL” or “non-relational”) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. The name “NoSQL” was coined in the early 21st century. NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called “Not only SQL” to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.
Motivations for this approach include simplicity of design, simpler “horizontal” scaling to clusters of machines (which is a problem for relational databases, see DB-cluster description above), finer control over availability and limiting the object-relational impedance mismatch.
The data structures used by NoSQL databases (e.g. key–value pair, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL. The particular suitability of a given NoSQL database depends on the problem it must solve. Sometimes the data structures used by NoSQL databases are also viewed as “more flexible” than relational database tables.
Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability, partition tolerance, and speed. Barriers to the greater adoption of NoSQL stores include the use of low-level query languages (instead of SQL, for instance), lack of ability to perform ad hoc joins across tables, lack of standardized interfaces, and huge previous investments in existing relational databases. Most NoSQL stores lack true ACID transactions, although a few databases have made them central to their designs.
Instead, most NoSQL databases offer a concept of “eventual consistency”, in which database changes are propagated to all nodes “eventually” (typically within milliseconds, but this is theoretically unbound), so queries for data might not return updated data immediately or might result in reading data that is not accurate, a problem known as stale reads. Additionally, some NoSQL systems may exhibit lost writes and other forms of data loss.
As mentioned above, there are different types of NoSQL databases: key–value store, document store, Graph, etc. I will focus on document type.
Document-oriented databases are inherently a subclass of the key-value store, another NoSQL database concept. The difference lies in the way the data is processed; in a key-value store, the data is considered to be inherently opaque to the database, whereas a document-oriented system relies on internal structure in the document in order to extract metadata that the database engine uses for further optimization. Although the difference is often negligible due to tools in the systems, conceptually the document-store is designed to offer a richer experience with modern programming techniques.
Document databases contrast strongly with the traditional relational database (RDBMS). Relational databases generally store data in separate tables that are defined by the programmer, and a single object may be spread across several tables. Document databases store all information for a given object in a single instance in the database, and every stored object can be different from every other. This eliminates the need for object-relational mapping while loading data into the database.
The central concept of a document-oriented database is the notion of a document. All document-oriented database assumes documents encapsulate and encode data (or information) in some standard format or encoding. Encodings in use include XML, YAML, JSON, as well as binary forms like BSON. I will focus on JSON.
Documents in a document store are roughly equivalent to the programming concept of an object. They are not required to adhere to a standard schema, nor will they have all the same sections, slots, parts or keys. Generally, programs using objects have many different types of objects, and those objects often have many optional fields. Every object, even those of the same class, can look very different. Document stores are similar in that they allow different types of documents in a single store, allow the fields within them to be optional. For example, the following is a document, encoded in JSON:
“Address”: “5 Oak St.”,
Unlike a relational database where every record contains the same fields, leaving unused fields empty; there are no empty ‘fields’ in document (record) in the above example. This approach allows new information to be added to some records without requiring that every other record in the database share the same structure.
Document databases typically provide for additional metadata to be associated with and stored along with the document content. That metadata may be related to facilities the datastore provides for organizing documents, providing security, or other implementation specific features.
Documents are addressed in the database via a unique key that represents that document. This key is a simple identifier (or ID), typically a string, a URI, or a path. The key can be used to retrieve the document from the database. Typically the database retains an index on the key to speed up document retrieval, and in some cases the key is required to create or insert the document into the database.
Another defining characteristic of a document-oriented database is that, beyond the simple key-to-document lookup that can be used to retrieve a document, the database offers an API or query language that allows the user to retrieve documents based on content (or metadata). For example, you may want a query that retrieves all the documents with a certain field set to a certain value. The set of query APIs or query language features available, as well as the expected performance of the queries, varies significantly from one implementation to another. Likewise, the specific set of indexing options and configuration that are available vary greatly by implementation.
It is here that the document store varies most from the key-value store. In theory, the values in a key-value store are opaque to the store, they are essentially black boxes. They may offer search systems similar to those of a document store, but may have less understanding about the organization of the content. Document stores use the metadata in the document to classify the content, allowing them, for instance, to understand that one series of digits is a phone number, and another is a postal code. This allows them to search on those types of data, for instance, all phone numbers containing 555, which would ignore the zip code 55555.
Document databases typically provide some mechanism for updating or editing the content (or metadata) of a document, either by allowing for replacement of the entire document, or individual structural pieces of the document.
Document database implementations offer a variety of ways of organizing documents, including notions of
- Collections: groups of documents, where depending on implementation, a document may be enforced to live inside one collection, or may be allowed to live in multiple collections
- Tags and non-visible metadata: additional data outside the document content
- Directory hierarchies: groups of documents organized in a tree-like structure, typically based on path or URI
Sometimes these organizational notions vary in how much they are logical vs physical, (e.g. on disk or in memory), representations.
In a relational database, data is first categorized into a number of predefined types, and tables are created to hold individual entries, or records, of each type. The tables define the data within each record’s fields, meaning that every record in the table has the same overall form. The administrator also defines the relationships between the tables, and selects certain fields that they believe will be most commonly used for searching and defines indexes on them. A key concept in the relational design is that any data that may be repeated is normally placed in its own table, and if these instances are related to each other, a column is selected to group them together, the foreign key. This design is known as database normalization.
For example, an address book application will generally need to store the contact name, an optional image, one or more phone numbers, one or more mailing addresses, and one or more email addresses. In a canonical relational database, tables would be created for each of these rows with predefined fields for each bit of data: the CONTACT table might include FIRST_NAME, LAST_NAME and IMAGE columns, while the PHONE_NUMBER table might include COUNTRY_CODE, AREA_CODE, PHONE_NUMBER and TYPE (home, work, etc.). The PHONE_NUMBER table also contains a foreign key column, “CONTACT_ID”, which holds the unique ID number assigned to the contact when it was created. In order to recreate the original contact, the database engine uses the foreign keys to look for the related items across the group of tables and reconstruct the original data.
In contrast, in a document-oriented database there may be no internal structure that maps directly onto the concept of a table, and the fields and relationships generally don’t exist as predefined concepts. Instead, all of the data for an object is placed in a single document, and stored in the database as a single entry. In the address book example, the document would contain the contact’s name, image, and any contact info, all in a single record. That entry is accessed through its key, which allows the database to retrieve and return the document to the application. No additional work is needed to retrieve the related data; all of this is returned in a single object.
A key difference between the document-oriented and relational models is that the data formats are not predefined in the document case. In most cases, any sort of document can be stored in any database, and those documents can change in type and form at any time. If one wishes to add a COUNTRY_FLAG to a CONTACT, this field can be added to new documents as they are inserted, this will have no effect on the database or the existing documents already stored. To aid retrieval of information from the database, document-oriented systems generally allow the administrator to provide hints to the database to look for certain types of information. These work in a similar fashion to indexes in the relational case. Most also offer the ability to add additional metadata outside of the content of the document itself, for instance, tagging entries as being part of an address book, which allows the programmer to retrieve related types of information, like “all the address book entries”. This provides functionality similar to a table, but separates the concept (categories of data) from its physical implementation (tables).
In the classic normalized relational model, objects in the database are represented as separate rows of data with no inherent structure beyond that given to them as they are retrieved. This leads to problems when trying to translate programming objects to and from their associated database rows, a problem known as object-relational impedance mismatch. Document stores more closely, or in some cases directly, map programming objects into the store. These are often marketed using the term NoSQL.
A document-oriented database is a specialized key-value store, which itself is another NoSQL database category. In a simple key-value store, the document content is opaque. A document-oriented database provides APIs or a query/update language that exposes the ability to query or update based on the internal structure in the document. This difference may be minor for users that do not need richer query, retrieval, or editing APIs that are typically provided by document databases. Modern key-value stores often include features for working with metadata, blurring the lines between document stores.
Some search engine (aka information retrieval) systems like Elasticsearch provide enough of the core operations on documents to fit the definition of a document-oriented database.
Compared to relational databases, collections could be considered analogous to tables and documents analogous to records. But they are different: every record in a table has the same sequence of fields, while documents in a collection may have fields that are completely different.
Since most NoSQL databases lack ability for joins in queries, the database schema generally needs to be designed differently. There are three main techniques for handling relational data in a NoSQL database. (See table Join and ACID Support for NoSQL databases that support joins.)
Instead of retrieving all the data with one query, it is common to do several queries to get the desired data. NoSQL queries are often faster than traditional SQL queries so the cost of additional queries may be acceptable. If an excessive number of queries would be necessary, one of the other two approaches is more appropriate.
Instead of only storing foreign keys, it is common to store actual foreign values along with the model’s data. For example, each blog comment might include the username in addition to a user id, thus providing easy access to the username without requiring another lookup. When a username changes however, this will now need to be changed in many places in the database. Thus this approach works better when reads are much more common than writes.
With document databases like MongoDB it is common to put more data in a smaller number of collections. For example, in a blogging application, one might choose to store comments within the blog post document so that with a single retrieval one gets all the comments. Thus in this approach a single document contains all the data you need for a specific task.
MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas.
Fields in a MongoDB document can be indexed with primary and secondary indices or index.
MongoDB provides high availability with replica sets. A replica set consists of two or more copies of the data. Each replica-set member may act in the role of primary or secondary replica at any time. All writes and reads are done on the primary replica by default. Secondary replicas maintain a copy of the data of the primary using built-in replication. When a primary replica fails, the replica set automatically conducts an election process to determine which secondary should become the primary. Secondaries can optionally serve read operations, but that data is only eventually consistent by default.
If the replicated MongoDB deployment only has a single secondary member, a separate daemon called an arbiter must be added to the set. It has a single responsibility, which is to resolve the election of the new primary. As a consequence, an idealized distributed MongoDB deployment requires at least three separate servers, even in the case of just one primary and one secondary.
MongoDB scales horizontally using sharding. The user chooses a shard key, which determines how the data in a collection will be distributed. The data is split into ranges (based on the shard key) and distributed across multiple shards. (A shard is a master with one or more replicas.) Alternatively, the shard key can be hashed to map to a shard — enabling an even data distribution.
MongoDB can run over multiple servers, balancing the load or duplicating data to keep the system up and running in case of hardware failure.
MongoDB can be used as a file system, called GridFS, with load balancing and data replication features over multiple machines for storing files.
This function, called grid file system, is included with MongoDB drivers. MongoDB exposes functions for file manipulation and content to developers. GridFS can be accessed using mongofiles utility or plugins for Nginx] and lighttpd. GridFS divides a file into parts, or chunks, and stores each of those chunks as a separate document.
MongoDB provides three ways to perform aggregation: the aggregation pipeline, the map-reduce function, and single-purpose aggregation methods.
Map-reduce can be used for batch processing of data and aggregation operations. But according to MongoDB’s documentation, the Aggregation Pipeline provides better performance for most aggregation operations.
The aggregation framework enables users to obtain the kind of results for which the SQL GROUP BY clause is used. Aggregation operators can be strung together to form a pipeline — analogous to Unix pipes. The aggregation framework includes the $lookup operator which can join documents from multiple collections, as well as statistical operators such as standard deviation.
MongoDB supports fixed-size collections called capped collections. This type of collection maintains insertion order and, once the specified size has been reached, behaves like a circular queue.
MongoDB claims to support multi-document ACID transactions since the 4.0 release in June 2018. This claim was found to not be true as MongoDB violates snapshot isolation.
Up until version 3.3.11, MongoDB could not do collation-based sorting and was limited to byte-wise comparison via memcmp which would not provide correct ordering for many non-English languages when used with a Unicode encoding. The issue was fixed on August 23, 2016.
Prior to MongoDB 4.0, queries against an index were not atomic. Documents which were being updated while the query was running could be missed. The introduction of the snapshot read concern in MongoDB 4.0 eliminated this phenomenon.
Although MongoDB claims in an undated article entitled “MongoDB and Jepsen” that their database passed Distributed Systems Safety Research company Jepsen’s tests, which it called “the industry’s toughest data safety, correctness, and consistency Tests”, Jepsen published an article in May 2020 stating that MongoDB 3.6.4 had in fact failed their tests, and that the newer MongoDB 4.2.6 has more problems including “retrocausal transactions” where a transaction reverses order so that a read can see the result of a future write. Jepsen noted in their report that MongoDB omitted any mention of these findings on MongoDB’s “MongoDB and Jepsen” page.
While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multitable inserts and create table as select. HiveQL lacked support for transactions and materialized views, and only limited subquery support. Support for insert, update, and delete with full ACID functionality was made available with release 0.14.
Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce, Tez, or Spark jobs, which are submitted to Hadoop for execution.
Comparison with traditional databases
The storage and querying operations of Hive closely resemble those of traditional databases. While Hive is a SQL dialect, there are a lot of differences in structure and working of Hive in comparison to relational databases. The differences are mainly because Hive is built on top of the Hadoop ecosystem, and has to comply with the restrictions of Hadoop and MapReduce.
A schema is applied to a table in traditional databases. In such traditional databases, the table typically enforces the schema when the data is loaded into the table. This enables the database to make sure that the data entered follows the representation of the table as specified by the table definition. This design is called schema on write. In comparison, Hive does not verify the data against the table schema on write. Instead, it subsequently does run time checks when the data is read. This model is called schema on read. The two approaches have their own advantages and drawbacks. Checking data against table schema during the load time adds extra overhead, which is why traditional databases take a longer time to load data. Quality checks are performed against the data at the load time to ensure that the data is not corrupt. Early detection of corrupt data ensures early exception handling. Since the tables are forced to match the schema after/during the data load, it has better query time performance. Hive, on the other hand, can load data dynamically without any schema check, ensuring a fast initial load, but with the drawback of comparatively slower performance at query time. Hive does have an advantage when the schema is not available at the load time, but is instead generated later dynamically.
Transactions are key operations in traditional databases. As any typical RDBMS, Hive supports all four properties of transactions (ACID): Atomicity, Consistency, Isolation, and Durability. Transactions in Hive were introduced in Hive 0.13 but were only limited to the partition level. Recent version of Hive 0.14 had these functions fully added to support complete ACID properties. Hive 0.14 and later provides different row level transactions such as INSERT, DELETE and UPDATE. Enabling INSERT, UPDATE, DELETE transactions require setting appropriate values for configuration properties such as hive.support.concurrency, hive.enforce.bucketing, and hive.exec.dynamic.partition.mode.