The ultimate guide to predicting future trends and outcomes with Python.

Learn to create accurate predictive models with Python that drive business success.

Asish Biswas
AnalyticSoul
4 min readMay 22, 2024

--

We often find ourselves in a situation where we need to combine information from multiple sources to answer queries. Data modeling is the process of connecting different data structures related to a use-case and visually representing them.

Data modeling includes three main phases: (1) the conceptual data model, (2) the logical data model, and (3) the physical data model. In this lesson, we’ll (re)visit the merging operation that’s often used in the physical data modeling.

Joining or Merging

In database management systems, join or merge operations allow us to combine data from multiple tables based on common fields. Let’s assume we have a retail business, where transaction records come from a POS system and customer details are stored in a CRM (Customer Relationship Management) system. Now, we’d like to perform a report where we want to see quarterly revenue per customer. In such scenarios, we must combine these two systems for a complete view. There are a couple of ways to achieve this in Pandas. Merging is one of them. Let’s see how it’s done.

We’ll start by importing necessary libraries, loading two example datasets ( customer and transactions ) and inspecting them.

import pandas as pd
import matplotlib.pyplot as plt

df_transactions = pd.read_csv('data/mock_transactions.csv', header=0)
df_customers = pd.read_csv('data/mock_customers.csv', header=0)
print('Examining the transaction dataframe')
print('Shape of transaction df', df_transactions.shape)
df_transactions.head()
print('Examining the customer dataframe')
print('Shape of customer df', df_customers.shape)
df_customers.head()

We have created two dataframes (df_customers and df_transactions). We see that the df_transactions has 5 columns and the df_customers has 4. Both dataframes have a common column CustomerID (key column). We will join these two dataframes based on this key column.

Join customer and invoice dataframes

In the above illustration, both tables share a common column CustomerID. The merge operation combines columns from both tables based on this matching column.

df_combined = df_transactions.merge(right=df_customers, how='left', on='CustomerID')
print('Shape of transaction df', df_combined.shape, '\n')
df_combined.head()

We use the Pandas merge() method to combine our datasets based on the common key column CustomerID.

We are using the left join here, meaning that the left dataframe (df_transactions) will be the base and wherever we'll find matching keys from the right dataframe (df_customers), we'll bring those rows. Notice that the resulting dataframe contains eight columns and 20 records (same as df_transactions). This means we brought in three new columns from the df_customers dataframe.

Different kinds of merging

Here are the different kinds of joining/merging available in Pandas merge() function:

  • Left: based on the left dataframe key, columns from the right table are imported.
Left join
  • Right: based on the right dataframe key, columns from the left table are imported.
Right join
  • Inner: merging happens based on the common keys between two dataframes only. Other records from both dataframes are ignored.
Inner join
  • Outer: merge based on the union keys from both dataframes.
  • Cross: merge based on the cartesian product of keys from both dataframes.
Cross join

Entity Relationship Diagram (ERD)

Entity Relationship Diagram (ERD) is the visual representation of the entities, their attributes, and the relationships between them. Here we see the relation between customer and transaction tables as an entity relation diagram (ERD).

Entity Relationship Diagram

Besides the merge() function, Pandas offer two other functions that help us combine datasets. They are concat() and join().

We’ve explored the fascinating world of data modeling, from conceptual design to actual code. In our next chapter, we’ll dive into building models with linear regression. Imagine being able to predict housing prices based on features like square footage, number of bedrooms, and location. Buckle up, because we’re about to unleash the power of machine learning!

Join the community

Join our vibrant learning community on Discord! You’ll find a supportive space to ask questions, share insights, and collaborate with fellow learners. Dive in, collaborate, and let’s grow together! We can’t wait to see you there!

--

--