SQL for Data Analysis

Joel Hunter
3 min readNov 22, 2022

--

Introduction

SQL (Structured Query Language) is a computer language that used to interact with database, managing the relation that takes such an important part in data analysis.

Over time, data becomes very important in all aspects and environments, for that, by studying SQL, we are able to process and manage data so that data can be stored properly processed.

Exercise

In the learning and understanding of SQL in this “Intensive Data Analysis Bootcamp” bootcamp, MySkill provides several case studies and questions to hone skills in the use of queries. There are several data sources used in this learning. The data source will be imported or executed to store the data into the database. In this case we already have several tables containing the data obtained from the data source earlier.

Next, we are asked to sort continents based on the highest number of companies and which continent has the most unicorns. Thus, we can use this query.

select continent, count(distinct company_id) as total_companies from unicorn_companies  group by continent order by count(company_id) desc

so we get the result like the picture below

From the result above we can see that North America is the continent with the most unicorns. We use count distinct clause to get total companies in every continent so we need to group the result by name of the continent with group by clause.

Moving on to the next issue where we can see how the having clause is implemented. Having is similar with where clause but more powerful where we can only use having to make an aggregate function as a condition for filtering data result. We asked about which countries have more than 100 unicorns? (Show amount). We have to get the total of the companies in every country and make it to the condition to filtering data result, so we use this query and get the result as the pictures below.

select country, count(distinct company_id) as total_companies 
from unicorn_companies group by 1
having count(distinct company_id) > 100 order by 2 desc

and from the result we can see that countries with more than 100 companies are United States and China with United States has the most companies in total 562 companies.

next is how to implement with clause to simplify such a long complex queries especially with lots of joins. we get that through the next issue, now we asked to show data on the number of unicorns in each industry and country of origin in Asia, with the exception of the three countries with highest number of unicorns. And here’s the quersy we can use and the result.

with top_3 as (select country, count(distinct company_id) as total_unicorn from unicorn_companies
where continent = 'Asia'
group by country order by 2 desc limit 3)

select ui.industry, uc.country, count(distinct uc.company_id) as total_company from unicorn_companies uc
inner join unicorn_industries ui on ui.company_id = uc.company_id
where uc.continent = 'Asia' and uc.country not in (select distinct country from top_3)
group by 1,2 order by 1,3 desc, 2

We use with clause to make an exception for three countries with most unicorns, use that again in our main query, put it after the not in clause so filtering data and requests are matched.

Some of the above case studies are some of the overall exercises provided by MySkill. Thanks to MySkill and the mentor who taught SQL on this bootcamp and if there are any suggestions and inputs towards this SQL exploration, I would be very grateful for it.

--

--