Databases - Create, Connect, and Query
I love databases. I love the organization, the structure, the simplicity. I enjoy breaking a complex concept into individual parts which ultimately reveals its foundational core. I find pleasure in identifying clearly defined connections. There’s no room for ambiguity or redundancy. It nurtures my OCD in the best possible way.
It is totally understandable if you aren’t quite on the same page as me. I am, admittedly, inordinately excitable about databases. But they are a complex topic with ever evolving implementations. There is a breadth and depth to databases that is difficult to grasp initially. So I’d like to start the discussion out with a basic question for the ages…
To SQL or NoSQL?
Personally, I started my journey with relational databases using SQL. I only recently learned about the NoSQL option. While I’m certainly biased, there are undeniable benefits to each depending on your use case.
SQL Databases hold relational data sets:
- They are setup as tables so that the data, which is saved as rows in the table, can be reorganized and accessed in a number of different ways.
- The schema is predefined and adhered to strictly.
NoSQL Databases hold non-relational or distributed data sets:
- They are referred to as a collection which is setup as a set of documents with key-value pairs, graph associations, or wide-column stores.
- The schema is dynamic for unstructured data.
So after taking that all in, if you think you have made the decision that is best for you, consider this…what if you use both types of databases on your project? In this case, you can truly leverage the power of both SQL and NoSQL for different tasks that require the strength of each. For example, if you were importing massive amounts of data from an API with a very specific nested data structure, then a NoSQL database would be optimal for import and quick data responses on an indexed field. However, if you also want to provide user search queries on the data based on a variety of fields, then you should probably sync that data into a SQL database with proper relational keys setup for rapid queries across multiple fields. In this case, you can link the two databases. You can utilize the NoSQL database for storing the majority of the data but also using the SQL database for mapping an index back to the NoSQL database with the functionality to quickly query other searchable fields. Both databases working together will actually give you the benefit of both types in a unified and coordinated approach.
So now you are ready to build, right? But HOW do you build and query your database? This ultimately brings up another hot topic…
ORM or Traditional SQL?
An ORM is a programming tool that helps developers interface with the database utilizing a predefined set of commands that are named in a way which makes the functionality clear to anyone who might be working with the code. The acronym means Object Relational Mapping and it creates an interface to facilitate interaction with your database. Use of the ORM minimizes the lines of code necessary to execute basic commands to your database such as find, create, or update with the support of basic operators and qualifiers. It’s limitations stem from its abstraction, as the simplified code can make it difficult to form a deeper understanding of your database. But it’s benefits come from the simplicity as your code base and development team grow.
Traditional data access techniques are the alternative option. That means using raw SQL language. In this case, there might be a few more lines of code required but you have significantly more control over your queries and what is returned as results.
Once this decision is made, there are plenty of online resources available to help you build out your database to the specs you have decided are right for your project. Regardless of your choices, you will need to follow these steps:
- Design the database - Build the schema or collection taking into consideration the needs of your system. Identify each column and its data type, define any default values or auto-update values such as datetime tracking, define primary keys or indexed values that help with speed during lookup queries, and define foreign keys that map tables together for relational databases.
- Create the database - Run an execution script from your command line to actually build the database. You can also run it from within your code as long as you set exceptions to avoid running every time the program starts.
- Populate the database - Use sample data or an initial set to fill the database. This will help you ensure that the schema is setup correctly. It also allows you to start executing and testing queries.
- Backup the database - Setup a process and a schedule to backup your data. This is important to avoid an issue down the road if the database or data is corrupted or lost. It is also helpful during development cycles to allow for code rollbacks in the case of a bug.
- Establish a connection to the database - Originating either directly from your server or through an ORM, the database will need to be initialized within your code when the app starts up. You also need to ensure that your database is started locally on your machine. You may need to pool connections considering your specific setup.
- Build queries - To extract data from the database, you will need to either build queries in raw SQL or reference your ORM documentation to determine the best way to extract your data.
- Optimize! - One way to optimize is to consider time complexity when building raw SQL queries. I’ll leave you with one example of how execution order can improve your queries.
Consider the below:
SELECT count(id), date(created_at, 'weekday 1'), status FROM records WHERE created_at >= '2017-07-01' AND created_at <= '2017-09-01' GROUP BY date(created_at, 'weekday 1'), status;
The above query is perfectly acceptable. As it’s written, WHERE will execute before the GROUP BY so it will execute on every record in the table. That’s fine if you have 20 or 100 records. But as the dataset grows, this can create a load on the system.
Now try another approach:
SELECT count(id), date(created_at, 'weekday 1'), status FROM records GROUP BY status
HAVING date(created_at, 'weekday 1') BETWEEN '2017-07-01' AND '2017-09-01';
In this case, HAVING will execute after the GROUP BY clause. So it will only execute on the post-aggregated records. You will save execution time but end up with the same result. These little time savings can add up as your dataset grows over time. So stay vigilant!
Now get out there and conquer that data!