Pandas VS SQL for Data Analysis

Aayushmaan Jain
Analytics Vidhya
Published in
4 min readMay 29, 2021

Often there is a debate about which one is better for Data Analysis, here we will generate random data and analyze it with both pandas as well as SQL to see which one is better for us.

In this case we shall be comparing both of them for ourselves by an example. We shall be analyzing the data provided to us on Pandas first and then we shall be analyzing the same on SQL.

First, let us see the schema of the data provided to us

Schema of the MySQL database

Here we see that this is the data for a restaurant which has 3 tables

People table — This table contains information about the people who visit the restaurant. This table has 4 columns

  1. ID — This column stores the person ID for the person visiting the restaurant. This column is the primary key in the table which uniquely identifies each person
  2. AGE — The age of the person who visits the restaurant
  3. GENDER — The gender of the person who visits the restaurant
  4. CITY — The city where the person lives in

Preferences table — This table contains the information about the food preference of the people. This table has 2 columns

  1. ID — The person ID of the person visiting the restaurant. This column is the primary key in the preferences table which uniquely identifies each customer but this column is also the foreign key in the preference table which references the ID column in the people table. This is done because there cannot be an entry in the preferences table unless there is an entry in the people table. This is because only the customers who have visited the restaurant can answer the survey about their preferred food types
  2. FOOD_TYPE — This column stores the preferred food type of each customer

Job table — This table contains information about the profession of each customer. This table also has 2 columns

  1. ID — The person ID of the person visiting the restaurant. This column is the primary key in the preferences table which uniquely identifies each customer but this column is also the foreign key in the preference table which references the ID column in the people table. This is done because there cannot be an entry in the job table unless there is an entry in the people table. This is because only the customers who have visited the restaurant can answer the survey about their job
  2. JOB — This column stores the profession of each customer

Reading the data into python so that we can compare both SQL and python

Importing the necessary libraries and frameworks

Establishing a connection to MySQL database and reading the data

Manipulating the data for better analysis

We can combine the data from these dataframes so that we can have all the information at one place

From the schema we can see that we can divide the age column into various age groups and we can divide the city column into various states which will aid us in our analysis and then we also create separate dataframes for men and women which also aids us in the analysis

Now we save the combined dataframe to a csv file and we create a table in the SQL database which has all the information from the csv file so that we get the newly created columns also in the new table

We create the new table in the database by using this code

Beginning the analysis

Firstly we analyze the number of men and women in the entire population and their age groups and their population according to their age groups in each state

Python

Age Group wise analysis in python

SQL

Age Group wise analysis in SQL

Then we perform state wise and city wise analysis

Python

City and State wise analysis in python

SQL

City and State wise analysis in SQL

Then we perform job wise analysis

Python

Job wise analysis in python

SQL

Job wise analysis in SQL

Final Comparison

SQL is more efficient in querying data but it has less functions whereas in pandas, there might be lag for large volumes of data but it has more functions which enable us to manipulate data in an effective way. Python also has various data visualization libraries like matplotlib, seaborn, plotly, Altair etc which helps us to visualize hence helping us to analyze data more effectively.

SQL works on a relational model which makes linking tables via keys much easier and regulates the entry of data in other tables in case of a foreign key constraint but Pandas makes data manipulation much easier.

Pandas plotting also makes it easier for us to analyze data with it’s plotting features which provide a quick plot for us to get better insights of the data. The transform function of Pandas also helps us to apply a variety on functions. Pandas join function also helps us to join the data on the index and merge function works like SQL which enables us to join on a particular column present in both the dataframes.

Pandas is better if you intend to manipulate the data or plot it as it does it all in one place whereas in SQL we have to use Tableau for data visualization

Github repository of the code: repository

My website : website

Written by:

Aayushmaan Anvesh Jain

Thank you for reading the blog

--

--

Aayushmaan Jain
Analytics Vidhya

A data science enthusiast currently pursuing a bachelor's degree in data science