Introduction to SQL for Data Science: A Complete Roadmap To Be A Data Scientist
Last week I wrote a blog on NumPy, “Data Visualization In Python: A Complete Roadmap To Be A Data Scientist”.. if you haven’t checked it out, please do check that out.
And here’s the fifth one..
So till now what we’ve learnt.. We learnt Python, Pandas, NumPy, and we’ve covered data visualization as well! Whooo, lots of stuff.. so proud of you!
Let’s get into the next part, which is SQL, obviously. I mean, you’re gonna be a data scientist or data analyst or data engineer, which means SQL is gonna be the most important part of your daily life. But before that..
Why SQL?
Well before that, let’s learn what SQL is…
SQL or Structured Query Language is basically a set of commands( it’s kind of a different programming language compared to Python/C++/Java ).
SQL is designed for managing data in a relational database management system( RDBMS ). So SQL is a database language, used for database creation, deletion, fetching rows or modifying columns etc.
So what is a database?
Database is where we store our data. Let’s say, you bought a book from Amazon last year, and maybe for some reason you want to check out that order, you can still find it out there. So on a daily basis whatever we do on any application, it is stored somewhere, and that place is the database. The data is stored because, if the user tries to get any info about his/her past activities, they can get it.
Now there are 2 types of database: SQL based( RDBMS )and NoSQL based.
SQL based database or RDBMS is a style of database where our data is stored in a table. Within a database, there will be multiple tables, each table has its rows and columns where the data is stored; also each table has a relation with each other, and that’s why it is called a Relational Database Management System( RDBMS ). And the language which we use to handle this kind of database, is none other than SQL.
Even though it is an old system, still, many companies use RDBMS these days. And that is why SQL is a must to learn, especially when you’re dealing with data.
Now that it is clear why SQL, let’s understand the different parts of SQL. So in SQL there are 4 types of commands.
Well technically there are 5 types, but we can consider DQL within the DML as well.
DDL or Data Definition Language is mostly used to create or delete or modify databases or tables.
DML or Data Manipulation Language is used to manipulate the data and get some insights from it. It is the most used part for any analyst.
DCL or Data Control Language is used to grant and take back authority from any database user. So if you’re working in a group, then you can give permissions to them to access the database or limit the access for them, using this command language.
TCL or Transaction Control Language is used to represent the transactions in query handling in RDBMS.
DQL or Database Query Language is used to query any info from the table or database.
So there are multiple things we can do with SQL, but for us( Data Nerds ), we’ve to mostly deal with the DML( with DQL ) and DDL.
And today we’ll be discussing the first part, the Data Definition Language part. Next week, I’ll be covering the Data Manipulation part.
One more thing to mention before hand is that all the commands I’ll be using here is for MySQL( which is a database management system ) and if you’re using other database management systems( like Microsoft SQL server or Oracle or MS Access ) then the structure and keywords might differ, but the concept will be totally same.
Create:
This command is used to create the database or its objects( like: table, index, function, views, stored procedure and triggers ).
# create a database
CREATE DATABASE employees; # yes, ending with a semicolon is a must.
USE employees;
And this is how we can create a database, it’s that simple.
The CREATE DATABASE command creates a database with the name typed after it.
The USE command lets us use the database. When there are multiple databases or tables, we can use this command along with the database/table name to use any of them.
# create a table\
CREATE DATABASE employee_details;
USE employee_details;
And this is how we can create and use a table.
We can also add a structure to the table, like each table will have rows and columns, and we can modify those rows-columns according to our needs.
CREATE TABLE person_details(
PersonId int AUTO_INCREMENT,
FirstName varchar(30) NOT NULL,
LastName varchar(30) NOT NULL,
EmailId varchar(30) NOT NULL,
City varchar(30),
MobileNo int NOT NULL,
Gender varchar(5),
PRIMARY KEY(PersonId),
UNIQUE(MobileNo, EmailId));CREATE TABLE employee_details (
EmployeeId int AUTO_INCREMENT,
CompanyName varchar(30) NOT NULL,
HybridWork char(5),
CompanyHQ varchar(30),
AccountNo int,
Income int,
UNIQUE(EmployeeID, accountno),
PRIMARY KEY(EmployeeId),
FOREIGN KEY(PersonId) REFERENCES person_details(PersonID));
We created 2 different tables over here with the name: person_details and employee_details respectively.
If we see the person_details table has 7 columns( PersonId, FirstName, LastName, EmailId, City, MobileNo, and Gender ).
Each of the columns has their own datatypes, like PersonId is of int type, some of them has type varchar( which means string in SQL ) with their own length in the parenthesis.
And then there are some constraints added in each of the columns. Constraints are used to specify rules for the data in a table. Like PersonId has AUTO-INCREMENT( which for every entry in the database automatically increments .. so after the 10th entry, if we enter one more row, it automatically will become 11 ), some of the other columns have NOT NULL( which ensures that a column cannot have a NULL value ).
Then in the end of the person_details table we’ve PRIMARY KEY and UNIQUE constraints. Unique ensures that all values in a column are different, whereas Primary Key is a combination of Not Null and Unique — uniquely identifies each row in a table.
So in each table there is something called a Primary Key, which column must have unique values and there shouldn’t be any null values. A table can only have ONE Primary Key and it can consist of single or multiple fields.
Just like Primary Key, there is one more constraint, Foreign Key. A Foreign Key is a field( or a collection of fields ) that refers to the Primary Key of the other table. Which means Primary Key and Foreign Key are the constraints which we use to create relations between two tables. One thing to remember, that a single table can have multiple Foreign Keys. Also, along with Foreign Key we use the REFERENCES command to actually create a relation from one table to the other one.
The table with the foreign key is called the child table, and the table with the primary key is called the parent or referenced table. Here in our example, the person_details is the parent table and the employee_details is the child table, as the employee_details have a Foreign Key which is Referenced to the PersonId of the person_details table.
Drop:
When there is a database/table which we don’t need any more, we can use this command to delete them. And this is a permanent deletion, so better to use it carefully.
# drop a database/table
DROP DATABASE database_1;
DROP TABLE table_1;
Truncate:
This command is used to delete all the data inside a table, while keeping the table intact. So only the data inside the table will be deleted.
# truncate a table
TRUNCATE TABLE table_1;
Alter:
Whenever we need to add/delete/modify columns inside an existing table or to add/delete constraints from that table, we use the Alter command.
# add a new column
ALTER TABLE person_details
ADD Age int;
Thus we can add a new column to an existing table. We just need to use the ADD command along with the column_name and its datatype.
# drop a column
ALTER TABLE person_details
DROP Age;
This is how we can delete a column from an existing table.
# change the datatype of a column
ALTER TABLE person_details
MODIFY COLUMN MobileNo int;
There are many things we can do with the help of Alter table command, and I’ll be showing some of them here.
Rename a column/table:
# rename a column
ALTER TABLE person_details
RENAME COLUMN old_name TO new_name;# rename a table
ALTER TABLE person_details
RENAME TO new_table_name;
Add constraints:
# add a constraint
ALTER TABLE person_details
ADD UNIQUE EmailId;# adding a constraint with a proper name
ALTER TABLE person_details
ADD CONSTRAINT 'constraint_name' PRIMARY KEY(EmailId, FirstName);
Delete constraints:
# deleting a constraint
ALTER TABLE person_details
DROP INDEX 'constraint_name';# deleting a primary key
ALTER TABLE person_details
DROP PRIMARY KEY;# deleting a foreign key
ALTER TABLE person_details
DROP FOREIGN KEY;
The Check constraint:
The CHECK constraint is used as a filter to limit the value range that can be placed in a column.
# adding a check constraint
ALTER TABLE person_details
ADD CHECK(EmployeeId > 0);# adding a check constraint
ALTER TABLE imp_table
ADD CONSTRAINT CHK_Person CHECK(Age>=18 AND City='Sandnes');
Default value:
We can add any default value to any particular column, and if the user gives any other value to the column, it will automatically change to the new value.
# adding a check constraint
ALTER TABLE imp_table
ALTER City SET DEFAULT 'Sandnes';
Now I’ve covered the basics of SQL, and if you want to learn more about it, please check here. Next day, I’ll be covering the Data Manipulation part or the advanced part of SQl, so stay tuned for that! 😉
Hope this helps. I’m going to share more of my journey, which will be totally beginner-friendly. Check out my other blogs on Medium or we can connect on LinkedIn.