SQL, The Oldest And The Most Underrated Data Skill!
Introduction
We live in a digital age where a ton of data is generated every minute, even as you read this post right now. Millions, if not billions of people are searching for something on Google at this very moment. Millions are purchasing online and many more are doing active bank transactions. With such rapid growth in data, businesses need an effective way to store this information and an effective way to analyze it. SQL was built to serve this exact need. But before we begin on why SQL is an underrated tool, let’s have a look at what exactly SQL is.
What is SQL?
SQL ( structured query language) is a database language that is used to create, maintain and retrieve information from relational databases. Created by IBM researchers Raymond Boyce and Donald Chamberlin in 1970, SQL, also known as SEQUEL, has become an important tool in a data scientist’s toolbox over the years. It helps in communicating with relational databases and play around with the data to get a better understanding of the database. The natural question you’d ask now is “Praneeth if it’s so important, why is it underrated then?” I’m getting just there.
SQL is often overlooked
With all the buzz around the word ‘data’ in recent times, many new-age programming languages capable of not just manipulating the data at will but also creating complex machine learning and deep learning models have emerged. Languages like Python and R have become the go-to choices when dealing with data. I myself am a fan of the capabilities of python, but I feel that SQL is often overlooked in the process. We get so obsessed with complicated models, colorful visualizations, and mind-blowing predictions that python helps us to do that we completely forget the roots of the data. Think about it, where did you get the data from to do all that complicated stuff? Yeah, a database. I blame the way SQL is taught in various online courses, which often guarantee that taking their course will make you a data scientist in a month or two (Haha, really?).
Many data aspirants look at SQL as just a language with which you can write queries. As someone spending most of my time with machine learning, I myself used to criticize SQL, saying “Why would I need SQL when I can do all the manipulations with python and pandas?”. If you think the same way, boy, you need to read this article! I also remember a friend of mine saying “Yeah, SQL. No big deal. It’s all about ‘ select * from table’ isn’t it?” Whoa, that’s a super simple way to put it, but yes. It’s true. But also, it’s not. It’s actually the beauty of SQL that makes it look so simple while actually, you can achieve a hell lot from just SQL and no other fancy tool.
SQL is incredibly powerful
For a language used “just” for querying, SQL is incredibly powerful. You can almost do any data manipulation task you can actually think of. One of my personal favorite things about SQL is the fact that its syntax is quite similar to that of English. Once I figure out how I’m going to get what I’m trying to, it doesn’t take much time for me to write the query, irrespective of the query’s complexity. With that said, SQL has some pretty powerful commands other than just having an easy-to-read syntax. For you to actually understand why SQL is more powerful than you might think, it’s important to know these commands and the scenarios you might use them in.
Absolute Basics (SELECT, DISTINCT, FROM, WHERE, ORDER BY)
These are DML (Data Manipulation Language) commands which are used to perform basic operations on your tables. SELECT is used to fetch records from your table. To fetch only the unique values from the table, you’d have to use DISTINCT along with SELECT. To select the table from which you need the records, you’ll have to use the FROM command. WHERE is a pretty useful command to filter out the data. By default, the SELECT command returns records in ascending order. If you need a different order, you need to use the ORDER BY command along with DESC. These are the most basic commands that you’ll use in almost every query that you write for data analysis. Quite simple! But the actual fun starts shortly!
GROUP BY and summary functions
The GROUP BY command is quite powerful and handy when you need to generate a summary report from your data. GROUP BY is coupled with summary functions like COUNT, SUM, AVG, MAX, etc. One example when you might be using this command is when you need to find how much monthly revenue you made, broken down into category-wise revenue, in the past 6 months. You’d most probably have a transactions table with the transaction date, transaction amount, and the transaction category. You will be using GROUP BY to group the data into months and categories and the Sum summary function to find the revenue from each category in each month. Pretty sweet and neat!
JOINS
I would argue that this is the most important SQL command among all. You could never truly witness the full potential of SQL if you never use SQL joins. Joins are used to merge two or more tables. This helps you to get a full sense of the data, which otherwise would have been lying scattered across the database. Now there are multiple types of joins that you’d want to use based on the problem you are trying to solve. An INNER join will fetch you all the records in the joined tables where there is a match in both the tables. LEFT and RIGHT joins are the same, but they return records that are present in either of the tables. There is another type of join called OUTER join, which returns all the records from both the tables, but you will find yourself rarely using it. One example where you’d be using the JOINS is when you need to find which city does your customers come from. You will most probably be having a users table and a cities table and you will have to join both these to extract the city information for each of your customers. Quite handy!
Subqueries, temp tables, and CTE’s
Some of my other favorite SQL features are subqueries, temp tables, and CTEs. While there are definitely some performance differences between these, they all serve the same purpose. While fetching data for your analysis, you’d often find yourself in a situation where you need the result of one query and use it in another query to further manipulate the data. This would have been a big headache if not for SQL! (Ah, such a lifesaver). SQL has not one but three different methods to achieve this. My personal favorite however is the CTEs. (Comparing these three would be out of scope for this article, so watch out for a separate article I’ll be writing on this). One example where you’d be using these is when you need to get the names of all the schools along with their details like district, students in that school, and average students per school in that particular district. You’d be having just one table here with school details, but to get average students per district you will have no choice but to use a CTE and join two summarized tables. (One with students in a school and another one with average students per school in each district. I’ll talk about CTE’s in detail in my upcoming post). Such a lifesaver!
CASE WHEN
Think of this as an IF-ELSE ladder that you’d typically use in other programming languages. This extremely helps in conditionally assigning values to your records. One example you might find yourself using this is when you need to assign a grade to each student based on his score. You’d have a table with students and their scores and you will be using CASE WHEN to assign a grade to each student based on some preset criteria.
Window functions
These are quite powerful when you need to work with other records in your table relative to your current row. These save a lot of time and make the entire operation quite seamless. There are a few different window functions, LEAD, LAG, RANK, to name a few. LEAD fetches the next record relative to the current row. LAG does the exact opposite, fetching the previous row relative to the current one. RANK assigns a rank to the records based on some specific criteria. Window functions need to be used together with the OVER clause. One example where you might use window functions is when you want to find your top 20 customers from each region. You’d have a table with customer details along with their region and the revenue that they bring. You would then use the RANK function and rank all the customers from each region and finally use the WHERE clause to filter the top 20 customers. Super clean!
Final Thoughts
These are just some of the super-handy features of SQL that I often use while working on my data projects. The flexibility of SQL makes it one of the best tools to use for data analysis. SQL is often perceived as just a query language, which it is, but if you actually know its capabilities, you would never hesitate to use it for analysis. It eliminates the hassle of exporting the data and using python to do all the manipulations. After all, why complicate stuff which you could do so easily?
Hope you found this article to be of some value. Consider sharing this with someone you think should read this!
Originally published at https://praneethvasarla.com on May 9, 2021.