Just about every popular app and social media platform has an engine under the hood that powers the information and data we absorb. With hundreds of millions of data points to keep track of, there has to be a well-oiled machine to maintain that database. Every developer has their favorite program to use for database management, but most programs utilize the Structured Query Language, or SQL, for short.
[Related Article: Switching Between MySQL, PosgreSQL, and SQLite]
SQL is a great tool for those looking to manage relational databases, which rely on connecting different datatypes based on relationships like interactions such as a user comment on a video. Every interaction is stored and accessed using queries that select exactly what data needs to be displayed on a webpage or otherwise. But SQL is much more than just a database management tool — it has plenty of applications across a number of fields, is an essential tool for cleaning and exploring data, and combining datasets for greater insights.
For the traditional coder who prefers to work in a shell, scripting queries shouldn’t be a problem, but for those who prefer to work in a more visual environment, there are several options available. For client-server applications, MySQL is a good go-to, but for this tutorial, I’ll be using DB Browser for SQLite. While SQLite isn’t client-server, I’ll just be going over simple queries to get you familiar with the syntax. Without further ado, let’s get started!
Once you’ve downloaded the appropriate DB Browser version for your system, we can start looking at data sources. At Open Data Science, we obviously love working with readily available data, and the city of Boston’s data portal has plenty to spare. I downloaded the Crime Incident Reports data to work with — if the data you’re using has a data dictionary, it’s important to study it to know exactly what each column contains, especially if it has vague naming.
To start, click the “create new database,” name your database, and save to a location of your choice.
One of the benefits of using DB Browser is the ability to create tables from comma-separated value files, which happens to be the format our crime data is in.
If you’re curious about the coding syntax involved here, or if you want to create a table from scratch, you can highlight the table and click “modify table” to view the script used to create our table, as shown here.
As you can see, each category is associated with a different data type, which are important to note when creating queries. At this point, you might want to get familiar with the data using the “Browse Data” tab.
Once you’ve looked through your data, we can begin writing simple queries to select specific data we want to highlight. First, click the “Execute SQL” tab to open a scripting shell. The hallmark command for most SQL queries is the SELECT command, which extracts columns and rows from a table depending on certain criteria. If we write “SELECT *”, we can include every column from the table, if necessary. Combined with the WHERE and FROM commands, SELECT allows you to create a new table in just a few lines. Here, I want to isolate incidents of auto thefts; after executing my query, this is what the output looks like:
We can see from our results that 5664 auto thefts have occurred over the last four years in Boston. What if we wanted to get counts of certain things, like which districts have the most reports? We can use the COUNT command to handle this — again, just a few lines of code and we’ll have a pretty nice dataset displaying the number of incidents per district. Note: it is important to use the GROUP BY command to display the count for each distinct district.
According to the Boston Police Department’s district key, Roxbury, Dorchester, and the South End have the most incidents reported over the last four years. As practice, I’ve created a column called Neighborhood and used the UPDATE command to insert the names of neighborhoods as they pertain to their respective police districts.
The last example I’ll demonstrate combines all the commands we’ve practiced so far. I want to get a list of the most common streets where crime scenes involving shootings occur. In addition, I want to create a new table from our query to export for later uses, like a visualization. Here’s the query and output I came up with:
If all goes accordingly, you should have a new table in your database that you can view in the “Browse Data” tab, which we can treat as its own dataset.
[Related Article: SQL Equivalents in R]
Now that you’ve learned the basic commands in SQL, the world is your oyster; these commands, when paired with a framework, power back-end systems that select what data is relevant to what your user sees. For those using SQL to clean data, these commands are useful tools, but with more practice and a better understanding of SQL’s syntax, cleaning, and exporting data should be a breeze, even for non-coders. So whether you’re a journalist, a lab technician, or a big data analyst, give SQL a go — it might just become your favorite tool for wrangling data.