MySkill Data Analysis Part 1 : SQL For Data Analysis

17 min readNov 22, 2022

SQL stands for Structured Query Language and is pronounced as Sequel. This is the language used in data analysis to communicate with data.

SQL for data analysis leads to the database questioning language’s capability, to communicate with various databases at once. SQL is the most used and flexible language. Since it connects a surprisingly convenient learning curve by a complicated depth. There are different versions SQL such as Big Query, MySQL, and Postgre SQL. But for this article, we will use the Postgre SQL version.

This article will explain the exercises that given during Intensive Bootcamp Data Analysis by MySkill.

The first thing we do is import dataset. How to import dataset?

Make sure you have installed PgAdmin
Import the required table using the sql script that has been provided via 2 ways :
a. (Alternative 1): open Query Tool > open SQL script file in pgadmin > run
b. (Alternative 2): open Query Tool > copy paste the contents of the file ending in .sql to pgadmin > run
Make sure the imported table is correct. Refresh the schema for the table to appear in left browser window.

Dataset

A unicorn company is a private startup company that has a valuation above
$1 billion. This dataset contains companies that were classified as unicorns in early April 2022, so startups that are no longer classified as unicorns due to IPO reasons or mergers such as Gojek are not included in this dataset. List of companies classified as unicorns at this point (November 2022) it might have been different.

unicorn_companies (you can click the link)

unicorn_dates (you can click the link)

unicorn_funding (you can click the link)

unicorn_industries (you can click the link)

Task 1

From that dataset, sort continents based on the highest number of companies. Which continents have most unicorns?
Source table: unicorn_companies

This request aims to find out which continent has the most unicorns and the order of continents based on the highest number of companies.

To answer the question above we will use this query :

select
 uc.continent,
 count(distinct uc.company_id) AS total_per_country
from unicorn_companies uc
 group by uc.continent
 order by total_per_country desc

The SELECT statement is used to select data on a column. In this case select uc.continent coloumn.
The COUNT function is used to return the number of rows that we got on the table.
DISTINCT is a statement that if we couple with SELECT (so SELECT DISTINCT) is used to return unique values.
FROM statement means that we’re gonna use a specific table (or tables) on this query. in this case, we use the unicorn_companies table.
GROUP BY statement means the aggregate functions that we use on our query will be grouped by specific columns. We use GROUP BY to group our aggregate function on the uc.continent column
ORDER BY keyword means that we order our output table based on values on specific columns. We use desc for descending values (big to small order), but If we want ascending values (small to big order) can use asc or don’t put anything after the column because the default of ORDER BY is ascending.

Output

The output of the query is:

From the output above, it can be answered that continents based on the highest number of companies is North America, Asia, Europe, South America, Oceania, Africa. And the continent that has the most unicorns is North America.

Task 2

Which countries have more than 100 unicorns? (Show amount)
Source table: unicorn_companies

This request aims to find out which countries have more than 100 unicorns.

To answer the question above we will use this query :

select
 uc.country,
 count(distinct uc.company_id) AS total_per_country
from unicorn_companies uc
 group by uc.country
 having count(distinct uc.company_id) >100

he SELECT statement is used to select data on a column. In this case select uc.country coloumn.
The COUNT function is used to return the number of rows that we got on the table.
DISTINCT is a statement that if we couple with SELECT (so SELECT DISTINCT) is used to return unique values.
FROM statement means that we’re gonna use a specific table (or tables) on this query. In this case, we use the unicorn_companies table.
GROUP BY statement means the aggregate functions that we use on our query will be grouped by specific columns. In this case we use group by uc.country
HAVING clause is similar to the WHERE clause, except this clause can only be used on aggregate functions. HAVING count(distinct uc.company_id)> 100 condition means that we want the column uc.company_id (alias for our COUNT(Distinct product_id) column) to have more than 100.

Output

The output of the query is:

From the output above, it can be answered that the countries that has more than 100 unicorns are China and United States.

Task 3

Which industry is the largest among unicorn companies based on total funding? What is the average valuation?
Source table: unicorn_industries, unicorn_funding

This request aims to find out which industry is the largest among unicorn companies based on total funding and what is the average valuation.

To answer the question above we will use this query :

select
 ui.industry,
 sum(uf.funding) AS total_funding,
 round(avg(uf.valuation),0) AS avg_valuation
from unicorn_industries ui
inner join unicorn_funding uf
 on ui.company_id = uf.company_id
group by 1
order by 2 desc

SELECT the ui.industry column
SELECT the total sum of uf.funding column values (we round the sum results so we don’t see decimal numbers). We name this column “avg_valuation”.
FROM statement means that we’re gonna use a specific table (or tables) on this query. In this case, we use the unicorn_industries table with alias “ui” as one of our tables in this query.
(INNER) JOIN the unicorn_industries table with the unicorn_funding table with alias “uf” (because we need the gender column from the unicorn_funding table) on the company_id column from the unicorn_industries table with the company_id column from the unicron_funding table.
We use GROUP BY to group our aggregate function
We use ORDER BY to order the output table. We use desc for descending values (big to small order), but If we want ascending values (small to big order) can use asc or don’t put anything after the column because the default of ORDER BY is ascending.

Output

The output of the query is:

From the output above, it can be answered that the largest among unicorn companies based on their total funding is Fintech. And the average valuation is 3937500000.

Task 4

Based on this dataset, for industry answer number 3 what is the number of companies who joins as a unicorn every year in the 2016–2022 range?
Source table: unicorn_companies, unicorn_industries, unicorn_dates

This request aims to find out the number of companies who joins as a unicorn every year in the 2016–2022 range.

To answer the question above we will use this query :

select
 extract(year from ud.date_joined) AS year_joined,
 count(distinct uc.company_id) AS total_company
from unicorn_companies uc
inner join unicorn_industries ui
 on uc.company_id = ui.company_id
inner join unicorn_dates ud
 on uc.company_id = ud.company_id
 and ui.industry = 'Fintech'
 and extract(year from ud.date_joined) between 2016 and 2022
 group by 1
 order by 1 desc

SELECT the ui.industry column
SELECT the total sum of uf.funding column values (we round the sum results so we don’t see decimal numbers). We name this column “avg_valuation”.
SQL EXTRACT returns an exact numeric value. For second, it also includes fractions. EXTRACT can only get single fields. To extract the full date (year, month, day) or time (hour, minute, second) from a timestamp, cast can be used.
FROM statement means that we’re gonna use a specific table (or tables) on this query. In this case, we use the unicorn_companies table with alias “uc” as one of our tables in this query.
(INNER) JOIN the unicorn_companies table with the unicorn_industries table with alias “ui” (because we need the gender column from the unicorn_industries table) on the company_id column from the unicorn_companies table with the company_id column from the unicron_industries table.
(INNER) JOIN the unicorn_companies table with the unicorn_dates table with alias “ud” (because we need the gender column from the unicorn_dates table) on the company_id column from the unicorn_companies table with the company_id column from the unicron_dates table.
AND use to combines multiple conditions using logical operators
We use GROUP BY to group our aggregate function.
We use ORDER BY to order the output table. We use desc for descending values (big to small order), but If we want ascending values (small to big order) can use asc or don’t put anything after the column because the default of ORDER BY is ascending.

Output

The output of the query is:

From the output above, it can be answered that the number of companies who joins as a unicorn every year in the 2016–2022 range are
Tahun 2022 = 31
Tahun 2021 = 138
Tahun 2020 = 15
Tahun 2019 = 20
Tahun 2018 = 10
Tahun 2017 = 6

Task 5

Show company detail data (company name, city of origin, country and continent of origin) along with the industry and its valuation.
a. Which country is the company with the largest valuation come from and what industry?
b. How about Indonesia? What company has the largest valuation in
Indonesia?
Source table: unicorn_companies, unicorn_industries, unicorn_funding

This request aims to find out This request aims to find out from which country the company with the largest valuation come from and from what industry also the company with the largest valuation in Indonesia.

To answer the question above we will use this query :

select uc.*,
 ui.industry,
 uf.valuation
from unicorn_companies uc
inner join unicorn_industries ui
 on uc.company_id = ui.company_id
inner join unicorn_funding uf
 on uc.company_id = uf.company_id
 order by uf.valuation desc

select uc.*,
 ui.industry,
 uf.valuation
from unicorn_companies uc
inner join unicorn_industries ui
 on uc.company_id = ui.company_id
inner join unicorn_funding uf
 on uc.company_id = uf.company_id
 where country = 'Indonesia'
 order by uf.valuation desc

SELECT the ui.industry column
FROM statement means that we’re gonna use a specific table (or tables) on this query. In this case, we use the unicorn_companies table with alias “uc” as one of our tables in this query.
(INNER) JOIN the unicorn_companies table with the unicorn_industries table with alias “ui” (because we need the gender column from the unicorn_industries table) on the company_id column from the unicorn_companies table with the company_id column from the unicron_industries table.
(INNER) JOIN the unicorn_companies table with the unicorn_funding table with alias “uf” (because we need the gender column from the unicorn_funding table) on the company_id column from the unicorn_companies table with the company_id column from the unicron_funding table.
We use ORDER BY to order the output table. We use desc for descending values (big to small order), but If we want ascending values (small to big order) can use asc or don’t put anything after the column because the default of ORDER BY is ascending.
To asnwer the 2nd question we use WHERE. WHERE is a clause to filter records. It is used to extract only those records that match the required conditions that we give. In these WHERE conditions we got : country = ‘Indonesia’. It means we want to select records (or rows) only when the records or rows on the country column are “Indonesia”

Output

The output of the query is:

company has the largest valuation in Indonesia

From the output above, it can be answered that :
a. The company with the largest valuation is Bytedance from China and the industry is Artificial Intellegence.
b. The company that has the largest valuation in Indonesia is J&T Express and the industry is Supply chain, logistics, & delivery

Task 6

How old was the oldest company when it merged into a unicorn company? Which country does the company come from?
Source table: unicorn_companies, unicorn_dates

This request aims to find out the oldest company when it merged into a unicorn company and which country does the company come from.

To answer the question above we will use this query :

select uc.*,
 ud.date_joined,
 ud.year_founded,
 extract(year from ud.date_joined) - ud.year_founded AS company_age
from unicorn_companies uc
inner join unicorn_dates ud
 on uc.company_id = ud.company_id
 order by company_age desc

SELECT the ui.industry column
SQL EXTRACT returns an exact numeric value. For second, it also includes fractions. EXTRACT can only get single fields. To extract the full date (year, month, day) or time (hour, minute, second) from a timestamp, cast can be used.
FROM statement means that we’re gonna use a specific table (or tables) on this query. In this case, we use the unicorn_companies table with alias “uc” as one of our tables in this query.
(INNER) JOIN the unicorn_companies table with the unicorn_dates table with alias “ud” (because we need the gender column from the unicorn_dates table) on the company_id column from the unicorn_companies table with the company_id column from the unicron_dates table.
We use ORDER BY to order the output table. We use desc for descending values (big to small order), but If we want ascending values (small to big order) can use asc or don’t put anything after the column because the default of ORDER BY is ascending.

Output

The output of the query is:

From the output above, it can be answered that the oldest company when it joined to become a unicorn company was 98 years old and came from Germany.

Task 7

For companies founded between 1960 and 2000 (upper and lower limits
fall into the range), how old was the oldest company when it merged to be a unicorn company (date_joined)? Which country does the company come from?
Source table: unicorn_companies, unicorn_dates

This request aims to find out the oldest company when it merged to be a unicorn company and which country does the company come from

To answer the question above we will use this query :

select uc.*,
 ud.date_joined,
 ud.year_founded,
 extract(year from ud.date_joined) - ud.year_founded AS company_age
from unicorn_companies uc
inner join unicorn_dates ud
 on uc.company_id = ud.company_id
 and ud.year_founded between 1960 and 2000
 order by company_age desc

SELECT the ui.industry column
SQL EXTRACT returns an exact numeric value. For second, it also includes fractions. EXTRACT can only get single fields. To extract the full date (year, month, day) or time (hour, minute, second) from a timestamp, cast can be used.
FROM statement means that we’re gonna use a specific table (or tables) on this query. In this case, we use the unicorn_companies table with alias “uc” as one of our tables in this query.
(INNER) JOIN the unicorn_companies table with the unicorn_dates table with alias “ud” (because we need the gender column from the unicorn_dates table) on the company_id column from the unicorn_companies table with the company_id column from the unicron_dates table.
AND use to combines multiple conditions using logical operators
We use ORDER BY to order the output table. We use desc for descending values (big to small order), but If we want ascending values (small to big order) can use asc or don’t put anything after the column because the default of ORDER BY is ascending.

Output

The output of the query is:

From the output above, it can be answered that the oldest company when the company merged to become a unicorn company was 37 years old and came from India.

Task 8

a. How many companies are financed by at least one investor with a name on them of ‘ventures’?

b. How many companies are financed by at least one investor with the name:
- Venture
-Capital
- Partners
Source table: unicorn_funding
Hint: Use LIKE and CASE WHEN inside COUNT DISTINCT

This request aims to find out how many companies are financed by at least one investor with the name: Venture, Capital, Partners

To answer the question above we will use this query :

select 
 count(distinct company_id) AS total_company
from unicorn_funding  uf
where lower(select_investors) like '%venture%'

select 
 count(distinct case when lower(select_investors) 
like '%venture%' then company_id end) as investor_venture,
 count(distinct case when lower(select_investors) 
like '%capital%' then company_id end) as investor_capital,
 count(distinct case when lower(select_investors) 
like '%partner%' then company_id end) as investor_partner
from unicorn_funding uf

SELECT the ui.industry column
The COUNT function is used to return the number of rows that we got on the table.
DISTINCT is a statement that if we couple with SELECT (so SELECT DISTINCT) is used to return unique values.The COUNT function is used to return the number of rows that we got on the table.
DISTINCT is a statement that if we couple with SELECT (so SELECT DISTINCT) is used to return unique values.
FROM statement means that we’re gonna use a specific table (or tables) on this query. In this case, we use the unicorn_funding table with alias “uf” as one of our tables in this query.
WHERE is a clause to filter records. It is used to extract only those records that match the required conditions that we give.
% is denotes any character regardless of length.

Output

The output of the query is:

From the output above, it can be answered that :

a. 603 companies financed by at least one investor with the name ‘ventures’.

b. Companies are financed by at least one investor with the name:
- Venture : 603 companies
- Capital : 611 companies
- Partners : 398 companies

Task 9

In Indonesia there are many startups engaged in logistics services. There is
How many logistics startups are unicorns in Asia? How many logistics startups are including unicorns in Indonesia?
Source table: unicorn_companies, unicorn_industries
Hint: Use DISTINCT and CASE WHEN to calculate the total logistics company in
Indonesia

This request aims to find out how many logistics startups are unicorns in Asia and in Indonesia

To answer the question above we will use this query :

select
 count(distinct uc.company_id) AS total_asia,
 count(distinct case when uc.country = 'Indonesia' then uc.company_id end) 
AS total_indonesia
from unicorn_companies uc
inner join unicorn_industries ui
 on uc.company_id = ui.company_id
 where ui.industry = '"Supply chain, logistics, & delivery"' 
and uc.continent = 'Asia'

SELECT the ui.industry column
The COUNT function is used to return the number of rows that we got on the table.
DISTINCT is a statement that if we couple with SELECT (so SELECT DISTINCT) is used to return unique values.
FROM statement means that we’re gonna use a specific table (or tables) on this query. In this case, we use the unicorn_companies table with alias “uc” as one of our tables in this query.
(INNER) JOIN the unicorn_companies table with the unicorn_industries table with alias “ui” (because we need the gender column from the unicorn_industries table) on the company_id column from the unicorn_companies table with the company_id column from the unicron_industries table.
WHERE is a clause to filter records. It is used to extract only those records that match the required conditions that we give. In these WHERE conditions we got : ui.industries = ‘“Supply chain, logistics, & delivery”’ and uc.continent = ‘Asia’. It means we want to select records (or rows) only when the records or rows on the uc.continent column are “Asia” and rows on the ui.industries coloumn are ‘“Supply chain, logistics, & delivery’”.
AND use to combines multiple conditions using logical operators
We use GROUP BY to group our aggregate function.
We use ORDER BY to order the output table. We use desc for descending values (big to small order), but If we want ascending values (small to big order) can use asc or don’t put anything after the column because the default of ORDER BY is ascending.

Output

The output of the query is:

From the output above, it can be answered that there are 26 logistics startups including unicorns in Asia.
And there are 1 logistics startups that are unicorns in Indonesia.

Task 10 (Bonus)

In Asia there are three countries with the highest number of unicorns. Show amount data unicorns in every industry and country of origin in Asia, with the exception of these three countries. Sort based on industry, number of companies (decreasing), and country of origin.
Source table: unicorn_companies, unicorn_industries
Hint: You can use a combination of CTE/subquery/JOIN/NOT IN

This request aims to find out amount data unicorns in every industry and country of origin in Asia, with the exception of these three countries.

To answer the question above we will use this query :

WITH top_3 AS (
SELECT
uc.country,
COUNT(DISTINCT uc.company_id) AS total_company
FROM unicorn_companies uc
WHERE uc.continent = 'Asia'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 3
)
SELECT
ui.industry,
uc.country,
COUNT(DISTINCT uc.company_id) AS total_company
FROM unicorn_companies uc
INNER JOIN unicorn_industries ui
ON uc.company_id = ui.company_id
LEFT JOIN top_3 t
ON uc.country = t.country
WHERE uc.continent = 'Asia' AND t.country IS NULL
GROUP BY 1,2
ORDER BY 1,3 DESC,2

SELECT the uc.country column
The COUNT function is used to return the number of rows that we got on the table.
DISTINCT is a statement that if we couple with SELECT (so SELECT DISTINCT) is used to return unique values.
FROM statement means that we’re gonna use a specific table (or tables) on this query. In this case, we use the unicorn_compnies table with alias “uc” as one of our tables in this query.
WHERE is a clause to filter records. It is used to extract only those records that match the required conditions that we give. In these WHERE conditions we got : uc.continent = ‘Asia’ means we want to select records (or rows) only when the records or rows on the uc.continent column are “Asia”.
We use GROUP BY to group our aggregate function.
We use ORDER BY to order the output table. We use desc for descending values (big to small order), but If we want ascending values (small to big order) can use asc or don’t put anything after the column because the default of ORDER BY is ascending.
LIMIT statement restricts how many rows a query returns. You can start a LIMIT statement at a particular row using the offset argument. In this case we use LIMIT 3
SELECT the ui.industry and uc.country column
The COUNT function is used to return the number of rows that we got on the table.
DISTINCT is a statement that if we couple with SELECT (so SELECT DISTINCT) is used to return unique values.
FROM statement means that we’re gonna use a specific table (or tables) on this query. In this case, we use the unicorn_compnies table with alias “uc” as one of our tables in this query.
(INNER) JOIN the unicorn_companies table with the unicorn_industries table with alias “ui” (because we need the gender column from the unicorn_industries table) on the company_id column from the unicorn_companies table with the company_id column from the unicron_industries table.
LEFT (OUTER) JOIN: Returns all rows from the left table, that have matching values on the related columns on the right table
WHERE is a clause to filter records. It is used to extract only those records that match the required conditions that we give. In these WHERE conditions we got : uc.continent = ‘Asia’. It means we want to select records (or rows) only when the records or rows on the uc.continent column are “Asia”
We use GROUP BY to group our aggregate function.
We use ORDER BY to order the output table. We use desc for descending values (big to small order), but If we want ascending values (small to big order) can use asc or don’t put anything after the column because the default of ORDER BY is ascending.

Output

The output of the query is:

This is the end of the explanation regarding the training assignments given by MySkill in Intensive Bootcamp Data Analysis to hone skills in using various queries in PostgreSQL. Thank you, I hope every explanation is useful.

If you want to go to the other parts, go here:

Part 2 : Phyton for Data Analysis
Part 3 : Data Visualitazion with Google Data Studio

Tips : if you want to master your sql skill further, you should check out hackerrank. This website has a lot of quizzes that will hone your skill further. You can also check w3school, geekforgeeks to practice SQL/data analyst. You can also search for other datasets from kaggle.com and make your own data analysis.

MySkill Data Analysis Part 1 : SQL For Data Analysis

Dataset

Task 1

Output

Task 2

Output

Task 3

Output

Task 4

Output

Task 5

Output

Task 6

Output

Task 7

Output

Task 8

Output

Task 9

Output

Task 10 (Bonus)

Output

Written by Restu Ajeng Kinasih