Joins — You Need to Know Them.
You may not like them but they’ll always be around for you.
Where ever you go, data is every where. Look at your car dashboard. It is a dashboard of data. It tells you your speed, mileage and gas in a few simple visuals. Go to a cash register and it tells you how much you have to pay. Go to work and you may be confronted with spreadsheets. Data is everywhere and to appreciate and work with them, you need to understand joins. If you are like me and have worked in this industry, the below may sound familiar to you. :)
Data comes in various sources, and to combine it into something useful requires clear understanding of the various “Joins” between data sets.
For simplicity sake, joins are can be thought of as a collection of common and different information between data sets.
A “connection” is the common key between the data sets. This connection is required among a vast majority of data sets in order to obtain anything insightful between them. The “Name” column between these two data sets acts as a connection between the two. For there to be a “connection”, there has to be a common column or columns between the two data sets.
I hope the below helps to explain this concept a bit further.
I won’t get into the technicalities here, but connecting data sets essentially requires you to understand joins and connections. There are a few things you need to worry about when it comes to connecting data sets — but I’ll write about this in another article.
Let’s get into various join examples.
I have two data sets — one, name and job title of employees in 2007 and another a list of employees in 2020. Each of these have their own rows.
I want to answer to answer 3 questions based on the data sets.
- Who are still working in this company and what jobs are they doing?
- Who are no longer in the company?
- Who are new to the company?
Who are still my working in this company and what jobs are they doing?
Inner join can answer this question. Inner join takes the employees in 2007 and combines it with 2020 using their employee name. The connection here is the “Name” column.
Here is a picture of inner join.
The common rows between them in this case are Peter, Jennifer and Alan.
Peter was a manager in 2007, now he is a CEO. Jennifer was a team lead, now she is a divisional lead. Alan is now a intern in a different sector.
Who are no longer in the company?
This is an example of left anti join. The data set is on the left compared to the right. This is why it is called a “left” anti join. We want to know the employees that do not appear in 2020. Here is an example below.
See those X’s? We can’t find Tom or Lisa in the 2020 list. Looks like Tom and Lisa are no longer in with company.
Who are new to the company?
This is an example of a right anti join. It is a “Right” anti join because the data set we are interested in is on the right hand side. It contains Sarah and Abraham who did not work for the company in 2007. Looks like they are new to the company.
Simple enough right? ☺
There is also the left join — it is the example of the left anti and inner join combined together.
How about the union/append/binding type joins? they are just data sets stacked one on top of another.
Inner Join — Common rows between data sets
Anti Join — Row difference between data sets
Left Join — Common and row difference between data sets together in one view includes only one side
Full Join — Common and row difference between data sets together in one view includes both left and right hand side
Union Join — data sets stacked one on top of another.
TL:DR ? I put this together for you.
In the next article, we will examine some simple ways of doing these joins on Power BI.
Hope you like this article!