EDA with SQL (MySQL)

EDA = Exploratory Data Analysis

Ramya N
Geek Culture
8 min readMar 23, 2021

--

Tools (Shell or Workbench) and Language (SQL) with MySQL RDBMS

Exploratory Data Analysis (aka EDA) is the process of Descriptive Data Analytics that helps to know what has happened based on the given historical data.

This is the initial Descriptive Analysis part of the Data Analysis (aka DA) that involves cleaning and transforming data for the next step of DA that can be of Data Modelling (determining and defining the relationships between the tables).

Why EDA is done?

To get the grasp on organization’s data assets and extract meaningful information from their historical patterns that further helps business team to perform required actions to meet business targets or enhance KPIs (Key Performance Indicators - to measure the performance of the tasks/targets ).

Here I have considered sports dataset on FIFA, to perform common and essential data tasks in EDA especially if you are beginning to analyze data, these are essential steps whether DA is performed with MySQL or Python or R. There could be subtle changes in the syntaxes for other relational databases.

In short, DA helps to extract valuable information from the given dataset.

Dataset: The sample dataset that can be available from Tableau website, under FIFA 18 then click on ‘Sample Data’ at the top, scroll down to ‘download’ button for .xlsx or .csv formats - https://public.tableau.com/en-us/s/resources

Tools & Language: MySQL Workbench (GUI for MySQL Relational Database Management System aka RDBMS) to visualize data in tabular form and MySQL Shell (CMD/Terminal tool i.e., non-GUI) to write & execute SQL statements to get the essential information from the dataset that can be in CSV or Excel or JSON or any other format.

With relational database - we can compute descriptive statistics using aggregation functions such as min(), max(), avg(), sum(), stddev() on a numerical fields/columns. It’s similar to using describe() function from Pandas if you are familiar in Data Analysis with Python.

EDA is required to prepare reports and data visuals (data visualization) to present the story about the analysis performed to answer the questions by the business team, however the flavours of DA comes in 5 categories, such as Descriptive, Diagnostic, Predictive, Prescriptive and Cognitive analytics (these are FYI) and each is performed based on the requirement and the application you are working on. Here we are working on Descriptive Analytics…

What’s in EDA here:

Exploratory Data Analysis, done on multiple scenarios. Ex: analysis of the company’s data assets or it’s used for building Machine Learning predictive models. EDA is performed in and for nearly all business types - to analyze KPIs and for companies adopting data-driven decision making.

If you are beginner to the Data Analysis or in the initial level of exploring data, below are the steps/statements I have explained with SQL statement and code snippets.

Note: Before you begin to write & execute SQL statements whether in MySQL Shell or Workbench, remember to keep the database server turned on, sometimes this step can be missed which happens with me! so as a note to turn, server on…

In the article To Interact with MySQL Database, you can find ways to connect to MySQL server where I have explained with Shell and Workbench tools along with the use-case of essential CRUD operations and about MySQL Server version in more details.

1. With MySQL Workbench:

Once the database server is turned on, you can open the GUI tool i.e., Workbench as seen in the below screenshot.

Image-1: This is landing screen once you open Workbench, next click on ‘Local instance 3306' to connect to local server instance and follow the on-screen instruction to enter password, if you have provided it during installation of MySQL server.

Overview of MySQL Workbench tool with the fifa_sport table in tech_blogs database:

Image-2: MySQL Workbench (GUI) to visualize the data in SQL tables form.

To activate the required database, execute the statements in Workbench as show above on line 1 or also shown in the video below.

In the video below, I have covered both - (i) Uploading CSV file to MySQL database via MySQL Workbench and (ii) To view data in the table form.

Video-1: Upload CSV file to database as a table in MySQL Workbench and view the data in SQL table.

Below are the statements can be executed either in Shell or Workbench. You can use either of these tools depending on your preference.

Tips: Comments in the MySQL are prefixed with consecutive double hyphens (-) as seen below.

-- Below statement to fetch the rows/records from tech_blogs table:mysql> select * from tech_blogs;

2. With MySQL Shell:

To activate the MySQL Shell: Enter the password if you have provided it during the installation of MySQL Server. Next, it outputs the mysql> as seen below which indicates you can start writing and executing SQL statements.

Image-3: To activate the MySQL Shell

After activating the Shell, initial checks I would do, are:

i. Checking for the existing/created databases in the connected instance.

mysql> show databases;

ii. Activating specific/required database.

mysql> use tech_blogs;

iii. Knowing the list of tables in the activated database.

mysql> show tables;
-- OR
mysql> show full tables;
Image-4: SQL statements for initial checks.

In the Image-2 and Video-1, explained for writing and executing SQL statements in the Workbench.

EDA SQL statements can be written and executed in either of these two tools. Below shown code snippets are from the MySQL Shell.

Image-5 & Option-1: To display the content in Shell, shows all columns with * operator in SELECT statement.

Another option to view the content in Shell with specific fields:

Image-6 & Option-2: To display the contents in Shell which is similar to table form, however with the selected columns - ID, Name, Age and Wage.

Below are the statements to learn about the metrics such as number of records/rows, to fetch records and unique values, structure/schema (columns, data type…) of table and the like.

So, without further delay let’s dive in for next steps.

The EDA helps you to know which features (fields) required for the analysis, if there are any outliers, any missing values, unique values, descriptive statistics (average, max and min values, standard deviation) which ofcourse depends on the scenarios - (i) if you are doing EDA to build predictive model (Machine Learning application) or (ii) to extract information for data-driven decisions making, depending on organization’s requirement, here I am explaining EDA for the scenario(ii) i.e., for data-driven decision making!

Image-7: Structure/schema - Displaying fields name, their data type, nullable or not, default value and so on.

Below are some of the questions, that can be answered from these SQL statements. By now you know that, we are working with MySQL (RDBMS) and SQL (language) here.

Q1: How many football players are there in FIFA organization?

Image-8: count(*) displays number of rows > so 222 players are stored in the dataset. One player per row so we will get the number of players.

Q2: How many nationalities are there?

Image-9: Number of nationalities are 43 - ‘distinct’ keyword helps to know unique values.

Q3: How many players from each country - i.e., frequency of nationality?

Image-10: 31 players from Spain, 25 players from Brazil, 25 players from France and so on.

Above statement also provides the information on which country has got more players i.e., highest number of players are from Spain.

So the top five countries with more number of players, can be fetched with the ‘limit’ keyword in descending order:

Image-11: Fetching top-5 countries with more number of players in descending order.

Q4: What is the highest amount of wage paid to the player?

Image-12: Using max() function to fetch highest wage from wage column.

Q5: Which player is getting paid highest and from which country?

Image-13: Player name and their country with highest paid.

Tip: To get highest wage, subquery can be passed to the outer query with ‘where’ clause to get the required answer instead of performing in two different statements to optimize scripting.

Image-14: One statement provides both answers as from Image-12 (Q4) & Image-13 (Q5).

Q6: What is the minimum wage paid?

This is similar to max() function, but use min() function as below:

mysql> select min(wage) as minimum_wage from fifa_sport;-- Output
+--------------+
| minimum_wage |
+--------------+
| 1000 |
+--------------+

Q7: Which player has got overall highest rating and from which club?

Image-15: Player G.Buffon with overall highest rating 88 from the club ‘Paris Saint-Germain’.

Q8: Which are top-3 clubs based on the overall rating?

Image-16: Top-3 clubs are ‘Vissel Kobe’, ‘Manchester City’ & ‘LA Galaxy’. Interestingly ‘FC Barcelona’ also has similar rating to ‘LA Galaxy’.

Q9: How many are, left or right foot players from each nationality and show top-10 records?

Image-17: 20 players are right-foot preferred from Brazil, 20 are right-foot preferred from Spain, 16 are right-foot preferred from Germany and so on.

When the columns are named with dot (.) in between phrases then two possibilities to execute it. Either enclose those names in back ticks (`) or Rename those field names. Here I have enclosed the Preferred.Foot field with back ticks.

Q10: Who are the players registered at ‘Paris Saint-Germain’ club?

Image-18: Above statement fetches the list of players from club ‘Paris Saint-Germain’. Players are G. Buffon, Marquinhos, P. Kimpembe, A. Rabiot, Dani Alves and A. Areola

Q11: In the year 2018, how many players have joined on which date and to which club?

Image-19: 2 players on 2018–01–01 joined GriEmio, 2 players on 2018–07–01 joined Liverpool and so on. If no players have joined, then those dates are not displayed.

Note: For ‘group by’ clause you can pass in either the field name or the sequence number of the field name. Here I have done ‘group by’ for both fields date(joined) i.e., 2 and club i.e., 3. And ‘date()’ function used, to eliminate the timestamp.

Q12: Another statistics from business perspective - how many players have joined every year from 2008 to 2018?

Image-20: 51 players joined in 2018, 40 players joined in 2017, 31 players joined in 2015 and so on.

Tips: The keywords and clauses are to be written sequentially in SQL statements. Example ‘Select’, ‘from’, ‘where’, ‘group by’, ‘order by’, ‘limit’ and so on. By default the records will be displayed in ascending order.

Summary: So depending on the questions, we can analyze the given dataset, thus it can be presented to the respective team with the report & requested information. For representing data in graph visualizations, you can use Tableau or Python with Seaborn/Matplotlib libraries or R language to enhance the report and presentation.

Hoping you enjoyed this article to get the insight into the initial level of descriptive analysis, to answer the above type of questions.

Thank you … !

--

--

Ramya N
Geek Culture

Data Analysis, Web & Full Stack Dev, Tech Writer & ML/DL/NLP Enthusiast | Code Instructor & Mentor | Health & Fitness Influencer