Alejandro Capecchi, Alex Chung, Katherine Zhang, Shannon Tran
- Explain the concept of a database and the types of database models that exist
- Discuss the practical uses of structuring data into various databases
- Evaluate the advantages and disadvantages of different types of databases and how it applies to a particular data set example
What Is A Database? Why Do We Need Them?
A database is an organized collection of information, ranging from something as simple as a daily journal to something more complex as a business inventory or excel spreadsheets. In either case, data is stored using a collection of rules and can be modeled in a variety of formats, such as relational, graphical, or document-based. As data scientists, our goal is to pick the right tool for the task at hand. What we pick is critical in order to make processing, analyzing, and querying data as efficient and easy as possible.
What Kinds of Database Models Exist?
For starters, a database model is a type of data organization that sets up the logical structure of your data and determines how it is stored, accessed, and updated. There are three popular types of database models that we worked with: relational, document-based, and graphical. Each has their specific nuances.
Database Models: What, How, and Why
A relational database is a collection of tables with predefined relationships — where each table is populated by columns and rows. Each column in the table holds a certain type of data, while a row represents a set of related values of one object or entity. An example of which is a Student table being composed of names, ID, and age. Values in different tables are connected or related through unique identifier columns, also known as keys.
What Are Keys? Are There Different Types?
In every table, there exists (usually) one column that contains a unique set of values or identifiers called the primary key and a set(s) of values that allow it to be related to another table called the foreign key(s). In other words, the primary key of one table can be matched to the foreign key of another table so that the two tables connect or relate to each other through those key values. Of course, there are situations for which multiple keys are required to distinguish a table. In this case, the two or more columns containing the set of values would be called composite keys. Keys are what allow us to make powerful joins to conjoin information into singular tables through relationships. Thus relational databases rely on merging rather than lookups for deriving information.
What Are Examples of Relational Databases?
Popular examples of relational databases include any of the SQL databases — think: MySQL, SQLite, PostgreSQL, as well as MariaDB. The relational database that we worked with was MySQL.
As the name suggests, document databases are designed for storing, managing, and retrieving document-oriented information. Rather than storing data in a tabular, relational format (with rows and columns), document databases store data in a flexible JSON format, where each document is independent of other documents. This means that different documents in the same collection do not need to follow the same schema (i.e. they do not need to have the same amount or type of information).
What Is A Document?
Each record in a document database is known as a document, which typically stores information about one object or entity and all of its related metadata. Information in documents follows a field-value structure (ex. first_name: John), whereby the field labels an attribute and the value describes the attribute of the object. Values can thus take on a wide range of types, including strings, dates, numbers, arrays, and so forth. As an alternative to relational schema, specific information involves lookups rather than joins.
What Are Examples of Document Databases?
Popular document-oriented databases include MongoDB, DynamoDB and CosmosDB. The document-based database that we worked with was MongoDB.
Unlike relational or document databases, graph databases are built to store, depict, and navigate relationships between data objects or items. Each data object is represented as a node and links to other object(s) via edges denoting relationships. You can think of nodes like nouns, and edges like verbs that connect these nouns. Both nodes and edges can have properties which carry information about the data object and the relationship between two connected objects.
A Deeper Dive Into Nodes And Edges
Nodes are entities or objects in the graph database. Each node can be tagged with labels, which represent the various roles within a particular domain. A node, much like a document, can hold any number of field-value (or key-value) pairs.
Edges, or relationships, depict the connection between a target node and a source node. Relationships have one direction and a type to classify the specific connection between the two nodes (a node can have a relationship with itself). They thus combine sets of nodes into complex structures and allow for a graphical visualization of the nodes interconnected and related to one another.
What Are Examples of Graph Databases?
Popular graph databases are Neo4J, OrientDB, Redis Labs and TigerGraph. The graph database that we worked with was Neo4J.
Which Database Is The Best?
This may seem like a cop out answer, but there simply is no single “superior” database type; rather, the optimal database model is the one that best fits your data structure and aligns with the type of data queries you want to make. Each of these three database models have their own advantages and limitations, so choosing the one that fits your data is highly dependent on how the data is stored and interacted with.
Overall Advantages and Disadvantages To Each
A relational database can be used to easily and efficiently store tabular information, and large amounts of data can be filtered and retrieved quickly. However, it is difficult to store temporal data as well as interface and act as a backend. Furthermore, relational systems often require join operations, which are computationally heavy and financially costly, in order to derive information.
Because document systems are a collection of independent documents, the database is very flexible in nature and allows for horizontal scaling of data. Furthermore, the JSON format of documents results in faster lookups; however, this functionality can be difficult to use because of the sheer amount of data that results from having a non-relational structure. The general use cases of document databases have both transactional and analytical purposes. Three examples are payment processing, operational and real-time analytics, and customer data management
Graph databases are best suited for smaller amounts of information as the requirements for storing data in such a format is the most costly of all. However, their visual nature makes them more intuitive than other databases but lends itself to slower relationship querying/traversing if the data is densely connected (a dense graph). In cases where connections are sparse it has significantly faster performance. Advantages of using graph databases come through for use cases such as social networking, recommendation engines, and fraud detection, when it’s necessary to be able to quickly create relationships between nodes and query them.
Evaluating Databases: Through The Lens of Our Data
Here, we’ll explain the data that we created and outline the pros and cons of each database model for our dataset.
Our mock dataset centers around university data that consists of student, course, department, and club-related information. Within the data, there are seven distinct data sources/tables: students, clubs, in-clubs, courses, sections, section(s) taken, and department. Below is a data dictionary describing each variable and the data source that it is attributed to.
Which Database Best Suited Our Data?
From an architectural standpoint, the relational database (MySQL) best suited our data. Given that we created and structured our data as tables in Google Sheets, our dataset was relational by nature and thus most easily populated into MySQL. Furthermore, querying using MySQL performed the best as well.
To best utilize MongoDB some of our information had to be nested inside certain collections; however this added layer of complexity made it difficult to query information. Similarly, Neo4j’s graphical structure became visually messy when trying to model more than a handful of relationship types and also could not handle tabular data with composite keys as elegantly; when trying to store some of the composite keys inside edges, it would often lead to a duplication of information (as nodes or as edges) due to the lack of a primary key.
How Did Our Data Queries Perform For Each?
Relational Database: MySQL
- Using tabular information to populate the database was a piece of cake
- Querying is easy to do and the syntax is relatively intuitive
- When joins are needed, keys allow for easy merges
- Aggregating data can be efficiently done
- Rigid data structure and schema so leaving out values would require additional data cleaning
- Joining tables are necessary to see all information pertaining to ‘primary objects/entities’ (in our case: students)
Document-based Database: MongoDB
- All information regarding one entity can be found within a single document. Example: a student document contained their personal information, as well as the courses they’ve taken.
- Embedding documents within other documents eliminates the need for traditional joins
- Nesting multiple tables into one collection can result in redundant or unrelated information in objects and make querying more complex
- Documents formatted similar to JSON objects can make query outputs more difficult to decipher as compared to a tabular output
Graph Database: Neo4J
- Easily see commonalities between data objects in our dataset (ex. students that shared a feature/characteristic, sections held in the same building, etc.)
- Gave a clear, student-centric view: all variables as relating to the student (i.e. what department they’re in, what class they’ve taken/are taking, what clubs they’re in and their role)
- Storing information inside relationships simplified the data whereby additional information or matching keys became properties
- Aggregation queries are not as easily achieved and become tabular, so might as well use a more intuitive SQL relational database
- No support for joins, so harder to represent complex data; hard to deal with composite keys which result in data duplication
- Visually messy on a large-scale with more than a handful of relationships so must resort to filtering
While we chose MySQL as the ‘best’ of the three types of databases we used, we can’t dismiss the perks that came with using some of the other databases. With that being said, selecting the right database really comes down to understanding your existing data structure and considering the questions you’re trying to answer. It’s also important to remain flexible and be open to adopting a different database model if your first pick does not perform as ideally as you want it to. Remember: databases are meant to make processing, analyzing, and querying data as efficient and easy as possible – not to make your life harder.