SQL FOR DATA SCIENCE

Prerna Nichani
Analytics Vidhya
Published in
6 min readAug 8, 2020

Along with R, Python, having a SQL knowledge is now an need of an hour for all those working/ looking out jobs in the field of Data Science. By 2025, the amount of data generated each day is expected to reach 463 exabytes globally. In order to store this large amount of data it is absolutely necessary for one to have Database knowledge.

“Data is the new oil” is perhaps one of the most popular catchphrases that can describe the fuel that makes our increasingly interconnected world go round.

Let’s look at some of the basics of SQL.

Introduction

SQL stands for Structured Query Language. SQL is used to communicate with a relational database. A relational database is a database that stores and provides access to data points that are related to one another. It is way of representing data in tables.The columns of the table hold attributes of the data making it easy to establish the relationships among data points.

A database most often contains one or more tables. A table is a collection of related data entries and it consists of columns and rows.

Each table is identified by a name (e.g. “Customers” or “Orders”). Tables contain records (rows) with data.

Every column in a Table is called a field. For Example, for a Customer Table : CustomerID, CustomerName, ContactName, Address, City, PostalCode and Country are the fields here.

A record, also called a row, is each individual entry that exists in a table. For example, there are 91 records in the above Customers table.

There exist many different SQL databases such as: SQLite, MySQL, Postgres, Oracle and Microsoft SQL Server.

SQL is not an case sensitive Language

An SQL Table consists a PRIMARY KEY which is a unique identifier and cannot contain null values. This key is used an foreign key in other tables and SQL JOINS can be done on the basis of the keys which is known as referential integrity in SQL.

Let’s Get Started

In this article I will introduce you to some basic SQL commands.

DDL(Data Definition Language) Commands: DDL or Data Definition Language actually consists of the SQL commands that can be used to define the database schema. It is used to create and modify the structure of database objects in the database.

· CREATE — is used to create the database or its objects (like table, index, function, views, store procedure and triggers).

Create table TableName(column1 datatype,column2 dataype,…. );

· DROP — is used to delete tables from the database.

Drop Table TableName;

· ALTER-is used to alter the structure of the database.

ALTER TABLE tablename
ADD column_name datatype;

· TRUNCATE -is used to remove all records from a table, including all spaces allocated for the records are removed.

Truncate Table TableName;

· RENAME -is used to rename an object existing in the database.

RENAME table_name TO new_name;

DQL (Data Query Language) :

DML statements are used for performing queries on the data .

SELECT — is used to retrieve data from the a database.

Select */columnlist from tablename;

(* is used to select all the columns from the Table. In order to select some columns the names of the columns is mentioned separated by a comma(,))

DML(Data Manipulation Language) : These SQL commands deals with the manipulation of data present in the database

· INSERT — is used to insert data into a table.

· UPDATE — is used to update existing data within a table.

· DELETE — is used to delete records from a database table.

DELETE FROM table_name WHERE condition;

DCL(Data Control Language) : DCL includes commands such as GRANT and REVOKE which mainly deals with the rights, permissions and other controls of the database system.

· GRANT-gives user’s access privileges to database.

· REVOKE-withdraw user’s access privileges given by using the GRANT command.

SQL JOINS

SQL Joins are used to combine rows from two or more tables, based on a related column between them.

Different Types of SQL JOINS

  • INNER JOIN: Returns records that have matching values in both tables

SELECT */column_name(s)
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;

  • LEFT JOIN: Returns all records from the left table, and the matched records from the right table

SELECT */column_name(s)
FROM table1
LEFT JOIN table2
ON table1.column_name = table2.column_name;

  • RIGHT JOIN: Returns all records from the right table, and the matched records from the left table

SELECT */column_name(s)
FROM table1
RIGHT JOIN table2
ON table1.column_name = table2.column_name;

· FULL JOIN: The FULL OUTER JOIN keyword returns all records when there is a match in left (table1) or right (table2) table records.

SOME IMPORTANT POINTS TO KNOW

The UNION Operator

  • Each SELECT statement within UNION must have the same number of columns and similar data types
  • The columns in each SELECT statement must also be in the same order.
  • The UNION Operator will eliminate duplicates in the output result.

SELECT column_name(s) FROM table1
UNION
SELECT column_name(s) FROM table2;

The UNION ALL Operator:

The UNION operator selects only distinct values by default. To allow duplicate values, use UNION ALL.

SELECT column_name(s) FROM table1
UNION ALL
SELECT column_name(s) FROM table2;

The ORDER BY keyword:

The ORDER BY keyword is used to sort the result-set in ascending or descending order.The ORDER BY keyword sorts the records in ascending order by default. To sort the records in descending order, use the DESC keyword.

SELECT column1, column2, …
FROM table_name
ORDER BY column1, column2, … ASC|DESC;

The SQL GROUP BY Statement

The GROUP BY statement groups rows that have the same values into summary rows.The GROUP BY statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.

SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
ORDER BY column_name(s);

The HAVING clause: The HAVING clause was added to SQL because the WHERE keyword could not be used with aggregate functions.

This was a simple introduction on how to get started using SQL to solve Data related problems. In the next article I will be posting a brief article on how to insert an CSV File into SQL Database using Database connectors from Python.

CONCLUSION:

The GROUP BY statement groups rows that have the same values into summary rows.The GROUP BY statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.

I hope this helps you , thank you for reading!

--

--