Data Analysis with SQL

Tesa Sekar Ayu Islami
10 min readNov 19, 2022

--

I’m a management college student. During college, mostly I learned about business understanding. Then, I wondered how to make a business decision only based on data. Then I got to know data analysis. How can accurate data become insights that are used for decision-making. This is what got me interested in data analysis.

Of course, there are many hardskills that must be learned to become a Data Analyst. The platforms used in data analysis are SQL, Google BigQuery, Python, Microsoft Excel, Tableau, Power BI, Google Data Studio, etc.

As we know, Excel is a platform for processing data. However, Excel cannot process large amounts of data or called big data. SQL is one of the platforms that we use to process big data. At first, I didn’t know anything about SQL, then I join Intensive Bootcamp Data Analysis by MySkill and learned the concepts and uses of SQL.

SQL (Structured Query Language) is a standardized programming language that is used to manage relational databases and perform various operations on the data in them. Below here some SQL commands and aggregate functions that are most commonly used.

SQL Commands and Function

Let’s start the data analysis!

DATA PREPARATION

I’m using PostgreSQL 15 and the database called unicorn with 4 tables that are unicorn_companies, unicorn_dates, unicorn_funding, and unicorn_industries tables.

Above, I created a database by clicking Databases > Create > Database. Then the image below will appear.

Next to ‘Database’, type ‘unicorn’ then click save. When the database ‘unicorn’ has appeared, we will import the tables. Below, click unicorn > Query Tool.

Click the ‘Open File’ then import the data one by one.

It will appear like this then click the play button above or F5 to run. After all tables are imported, check in the unicorn > schemas > public > tables. It must be appear like this.

DATA ANALYSIS

Question 1

Sort the continents based on the largest number of companies. Which continent has the most unicorns?

Solution :

For this question, I will ‘select’ what columns we want to show from the unicorn_companies table, that are continent and an aggregate function of ‘count’ the company_id, don’t forget the distinct. Grouping by the continent and ordering by the count of company_id descending. The query will be like this :

select continent, count(distinct company_id) as total_companies
from unicorn_companies
group by continent
order by count(distinct company_id) desc

The result

Output of Question 1

— Continent that has the most unicorn companies is ‘North America’ with 589 unicorns.

Question 2

Which countries have more than 100 unicorns? (Show the amount)

Solution :

Same as question 1, we just need to change the select to show country and count of company_id, and using another query statement ‘having’ which is useful for filtering data that contains aggregate functions. So, the query will be like this.

select country, count(distinct company_id) as total_companies
from unicorn_companies
group by country
having count(distinct company_id) > 100
order by count(distinct company_id) desc

The result

Output of Question 2

— The countries that have more than 100 unicorns are ‘United States’ with 562 companies and ‘China’ with 173 companies.

Question 3

Which industry is the biggest among unicorn companies based on total funding? What is the valuation average?

Solution :

In this question, we need to show industry, sum of funding, and valuation average columns. because the data comes from different tables, so we will use the ‘left join’ command. And the query will be like this.

select industry, sum(funding) as total_funding,
round(avg(valuation),0) as valuation_average
from unicorn_funding uf left join on unicorn industries ui on
uf.company_id=ui.company_id
group by industry
order by sum(funding) desc
limit 1

The result

Output of Question 3

— Biggest industry among unicorn companies based on total funding is from ‘Fintech’ with the total funding is 107996000000 and the valuation average is 3937500000.

Question 4

Based on this dataset, for the number 3 answer of industry, how many companies join as unicorns each year in the 2016–2022 range?

Solution :

We want to show the year joined from industry in answer number 3. So, we join the unicorn_companies with unicorn_dates then join again with unicorn_industries. Then filtering to 2016–2022 year range. The query will be like this.

select extract(year from date_joined) year_joined,
count(distinct uc.company_id) as total_companies
from unicorn_companies uc inner join unicorn_dates ud on
uc.company_id = ud.company_id
inner join unicorn_industries ui on
uc.company_id = ui.company_id
where industry = 'Fintech' and
extract(year from date_joined) between 2016 and 2022
group by extract(year from date_joined)
order by extract(year from date_joined) desc

The result

Output of Question 4

— This is the result of 2016–2022 year range, after I checked the data, there were no companies that joined unicorn company for ‘Fintech’ in 2016.

Question 5

Q1 : Show company detail data (name of company, city of origin, country and continent of origin) along with industry and its valuation. Which country is the company with the largest valuation from and what is the industry?

Solution :

We will show all of unicorn_companies columns, industry, and valuation so we joined unicorn_companies, unicorn_industry, and unicorn_funding. The query below.

select uc.*, industry, valuation
from unicorn_companies uc inner join unicorn_industries ui on
uc.company_id = ui.company_id
inner join unicorn_funding uf on
ui.company_id = uf.company_id
order by valuation desc
limit 10

The result

Output of Question 5 (1)

— Company with the largest valuation is from ‘China’ and ‘Artificial Intelligence’ industry.

Q2 : How about Indonesia? Which company has the highest valuation in Indonesia?

Solution :

Same as before, the difference is now we add the filter ‘Indonesia’. So, the query will be like this.

Select uc.*, industry, valuation
From unicorn_companies uc inner join unicorn_industries ui on
uc.company_id = ui.company_id
inner join unicorn_funding uf on
ui.company_id = uf.company_id
where country = 'Indonesia'
order by valuation desc

The result

Output of Question 5 (2)

— Company that has the highest valuation in Indonesia is ‘J&T Express’ from ‘Supply chain, logistics, & delivery’ with the valuation 20000000000.

Question 6

How old was the oldest company when it merged to become a unicorn company? Which country does the company come from?

Solution :

We have to show company, country, year_joined from date_joined, year founded, and company age from year joined — year founded. Then joined unicorn_companies with unicorn_dates. The query will be like this.

select uc.company, country, extract(year from date_joined) year_joined,
year_founded,
extract(year from date_joined) - year_founded as company_age
from unicorn_companies uc left join unicorn_dates ud on
uc.company_id = ud.company_id
order by company_age desc
limit 3

The result

Output of Question 6

— The oldest company from unicorn companies is ‘Otto Bock HealthCare’ from ‘Germany’ with 98 years old.

Question 7

For companies founded between 1960 and 2000 (the upper and lower limits fall into the range), how old was the oldest company when the company merged to become a unicorn company (date_joined)? Which country does the company come from?

Solution :

Almost the same as before, the difference is range year founded is 1960–2000. So, the query will be like this.

select year_founded, date_joined,
extract(year from date_joined) - year_founded as company_age,
company, country
from unicorn_companies uc left join unicorn_dates ud on
uc.company_id = ud.company_id
where year_founded between 1960 and 2000
order by extract(year from date_joined) - year_founded desc
limit 1

The result

— The oldest company that founded in range 1960–2000 is ‘Five Star Business Finance’ from ‘India’.

Question 8

Q1 : How many companies are financed by at least one investor bearing the name ‘venture’?

Solution :

We will use ‘like’ command to find company that bearing the name ‘venture’. So, the query will be like the below.

select count(distinct company_id) as total_companies
from unicorn_funding
where select_investors is not null and
lower(select_investors) like '%venture%'

The result

Output of Question 8 (1)

— Total companies financed by at least one investor bearing the name ‘venture’ is 603 companies.

Q2 : How many companies are financed by at least one investor that contains the names :

  • Venture
  • Capital
  • Partner

Solution :

Same as before, we just add the total company that financed by at least one investor that contains venture, capital, and partner. The query below.

Select count(distinct case when lower(select_investors) like '%venture%'
then company_id end) venture,
count(distinct case when lower(select_investors) like '%capital%'
then company_id end) capital,
count(distinct case when lower(select_investors) like '%partner%'
then company_id end) partner
from unicorn_funding
where select_investors is not null

The result

Output of Question 8 (2)

— Total company financed by at least one investor that contains name venture is 603 companies, capital is 611 companies, and partner is 398 companies.

Question 9

In Indonesia, there are many startups engaged in logistics services. How many logistics startups including unicorns are there in Asia? How many logistics startups are unicorns in Indonesia?

Solution :

We want to know how many logistics startups including unicorns in Asia and Indonesia. So, we need to joined unicorn_companies with unicorn_industries and filtering the continent to ‘Asia’ and industry to ‘Supply chain, logistics, & delivery’. We need count distinct of country = ‘Indonesia’ so we use CASE WHEN. The query will be like this.

select count(distinct uc.company_id) total_asia,
count(distinct case when country = 'Indonesia'
then uc.company_id end) total_indonesia
from unicorn_companies uc left join unicorn_industries ui on
uc.company_id = ui.company_id
where continent = 'Asia' and
industry = '"Supply chain, logistics, & delivery"'

The result

Output of Question 9

— The total startup unicorn companies in Asia is 26 companies and in Indonesia is 1 company.

Question 10

In Asia, there are three countries with the highest number of unicorns. Show data on the number of unicorns in each industry and country of origin in Asia, with the exception of these three countries. Sort by industry, number of companies (decreasing), and country of origin.

Solution :

In this question, we will use WITH __ AS function. This function is used to provide an alias in a query. Why we need to use this function now?

Based on the question, we need to find the country with the most unicorn companies except for the top 3 countries with the most unicorn companies. We don’t have data of the top 3 countries with the most unicorn companies so we have to make some calculations first. Therefore, we alias a query with the name top_3. Now, the first query will be like this.

with top_3 as (
select country, count(distinct company_id) as total_companies
from unicorn_companies
where continent = 'Asia'
group by country
order by count(distinct company_id) desc
limit 3
)

After make this function we will make another query and call out this alias.

select industry, country, count(distinct uc.company_id) as total_companies
from unicorn_companies uc left join unicorn_industries ui on
uc.company_id = ui.company_id
where continent = 'Asia' and
country not in (select country from top_3)
group by industry, country
order by industry, count(distinct uc.company_id) desc

Put it together.

with top_3 as (
select country, count(distinct company_id) as total_companies
from unicorn_companies
where continent = 'Asia'
group by country
order by count(distinct company_id) desc
limit 3
)
select industry, country, count(distinct uc.company_id) as total_companies
from unicorn_companies uc left join unicorn_industries ui on
uc.company_id = ui.company_id
where continent = 'Asia' and
country not in (select country from top_3)
group by industry, country
order by industry, count(distinct uc.company_id) de

The result

Output of Question 10

Question 11

The United States, China and India are the three countries with the most number of unicorns. Is there an industry that doesn’t have unicorns coming from India? What is it?

Solution :

In this question, we will use subquery. Subquery is make another query inside a query. We will make subquery ‘ina’ joined with unicorn_industries. The query will be like this.

select distinct ui.industry, ina
from unicorn_industries ui left join (
select distinct industry
from unicorn_companies uc left join unicorn_industries ui on
uc.company_id = ui.company_id
where country = 'India'
) ina
on ui.industry = ina.industry
where ina.industry is null

The result

Output of Question 11

— In industry_india or ina, doesn’t have ‘Artificial intelligence’, ‘Consumer & retail’, ‘Cybersecurity’, and ‘Hardware’ industries.

Question 12

Find the three industries that have the most unicorns in 2019–2021 and display the number of unicorns and their average valuation (in billions) each year.

Solution :

We will use the WITH __ AS function same like question 11. But, we will make 2 alias. So, the query will be like this.

with top_3 as (
select industry, count(distinct ui.company_id) total_companies
from unicorn_industries ui left join unicorn_dates ud on
ui.company_id = ud.company_id
where extract(year from date_joined) in (2019,2020,2021)
group by industry
order by count(distinct ui.company_id) desc
limit 3
),

yearly_rank as (
select industry, extract(year from date_joined) as year_joined,
count(distinct ui.company_id) as total_companies,
avg(valuation)/1000000000 as avg_val_bil
from unicorn_industries ui inner join unicorn_dates ud on
ui.company_id = ud.company_id
inner join unicorn_funding uf on
uf.company_id = ui.company_id
group by industry, extract(year from date_joined)
)

select yearly_rank.*
from yearly_rank inner join top_3 on
yearly_rank.industry = top_3.industry
where yearly_rank.year_joined in (2019,2020,2021)
order by industry, year_joined

The result

Output of Question 12

Yeay!

All of the questions have been answered. After all, it’s not easy to finish this questions. This is my knowledge about SQL, I still have a long way to learn. So, let’s move to Python for Data Analysis.

References :

Peter Loshin. Definition SQL. Accessed at : https://www.techtarget.com/searchdatamanagement/definition/SQL

--

--