Exploratory Data Analysis in SQL

Shawn
3 min readNov 9, 2022

--

Before we start, if you haven’t read my previous article for SQL, it would be helpful if you read it first.

Explore table size

Let’s have a little recap about SQL:

Now we have a table fortune500Let’s explore it step by step

  1. figure out how many rows are in fortune500

2. Subtract the count of the non-NULL ticker values from the total number of rows; alias the difference as missing.

Join tables

Part of exploring a database is figuring out how tables relate to each other. The company and fortune500 tables don't have a formal relationship between them in the database, but this doesn't prevent you from joining them.

To join the tables, you need to find a column that they have in common where the values are consistent across the tables. Remember: just because two tables have a column with the same name, it doesn’t mean those columns necessarily contain compatible data. If you find more than one pair of columns with similar data, you may need to try joining with each in turn to see if you get the same number of results.

Join.

There’s two ways to use “Join”

  1. SELECT table_column1, table_column2…
    FROM table_name1
    INNER JOIN table_name2
    ON table_name1.column_name=table_name2.column_name;
  2. SELECT table_column1, table_column2…
    FROM table_name1
    INNER JOIN table_name2
    USING (column_name);

Different kind of Join

https://www.w3schools.com/sql/sql_join_inner.asp

Read an entity relationship diagram

What is the most common stackoverflow tag_type? What companies have a tag of that type?

Coalesce

Evaluates the arguments in order and returns the current value of the first expression that did not initially evaluate to NULL.

For example:

SELECT COALESCE(NULL, NULL, ‘third_value’, ‘fourth_value’);

returns the third value because the third value is the first non-null value.

Case

In the fortune500 data, industry contains some missing values. Use coalesce() to use the value of sector as the industry when industry is NULL. Then find the most common industry.

Use coalesce() to select the first non-NULL value from industry, sector, or 'Unknown' as a fallback value.

Data type.

Two methods to assign data type.

SELECT value::new _ type;

SELECT CAST (value AS new _ type);

If you enjoy this article please click the thump up (Keep pressing can send multiple thump up!) and follow me, I’ll keep updating content about data science! Hope you like it.

Reference:

Christina Maimone: EXPLO RATO RY DATA ANALYSIS IN SQL, datacamp

--

--

Shawn

Self taught — Data Analyst | Business Intelligence Specialist | Business Analyst | Data scientist