Stories by NigarAli on Medium

Encoding Features

NigarAli — Mon, 08 Apr 2024 17:17:32 GMT

Encoding categorical variables is a vital step in preparing data for machine learning tasks. When dealing with categorical data, characterized by non-numeric values such as text or categories, it becomes necessary to transform them into a numerical format for compatibility with machine learning algorithms. Various widely-used categorical encoding techniques are available, each presenting its unique set of advantages and drawbacks

Danny Butvinik

Feature Encoding Tips

Tip 1: Prevent Data Leakage When converting categorical variables like “cat”, “dog”, and “horse” into numerical ones such as 0, 1, 2, etc., ensure that the encoder is fitted solely on the training data. This prevents data leakage, where information from the test data inadvertently influences the training process. After fitting on the training data, use the encoder to transform the validation/test data. If there are missing or new categories in the validation/test data, handle them by either removing unseen categories or encoding them as -1 or other arbitrary values.

Tip 2: Save Your Encoders After fitting on the training data, save encoders for later use in transforming validation/test data. This allows for the retrieval of categories or transforming encoded values back to their original categories if necessary, using methods like .inverse_transform.

1.Label Encoding

You can use LabelEncoder when you have categorical variables that are ordinal, meaning they have an inherent order or ranking. LabelEncoder assigns a unique numerical value to each category, effectively encoding the categories into integers.

For example, if you have a categorical variable “education_level” with categories like “High School”, “Bachelor’s Degree”, “Master’s Degree”, and “PhD”, where there is a clear order from lowest to highest level of education, you can use LabelEncoder to encode these categories as 0, 1, 2, and 3 respectively.

2.Ordinal encoding

Ordinal encoding is similar to label encoding but allows you to explicitly define the mapping between categories and integer labels. This is especially useful when there is a clear and predefined ordinal relationship. You manually specify the order of categories and map them to integers accordingly.

3.One hot/Dummy encoder

One-hot encoding can be done with OneHotEncoder from the sklearn package or using the pandas get_dummies method.For a categorical feature having many categories or levels, one-hot encoding is not a great choice from a machine learning perspective, most apparently due to a large number of dimensionality it adds up to.An increase in the dimensionality of the dataset causes curses of dimensionality, which leads to the problem of parallelism and multicollinearity.

Limit to x-most frequent features:

One-hot encoding the entire nominal categorical variable with many levels causes to increase the dimensionality. A better choice would be to take top x most frequent categories and create a dummy encoding or one-hot encoding.The less frequent categories are considered less influential and are thus left out or grouped into a single category, depending on the specific encoding strategy used. This helps simplify the dataset while retaining the most relevant information.

3.Frequency/Count Encoding

Frequency encoding is an encoding technique to transform an original categorical variable to a numerical variable by considering the frequency distribution of the data. It can be useful for nominal features.Frequency encoding is a technique or hacks used heavily in Kaggle competitions.Count encoding or frequency encoding, replaces each category with the count of how many times it appears in the dataset. This encoding technique can be useful when there’s a correlation between the frequency of a category and the target variable.

Instead of assigning arbitrary numerical values or creating binary columns like in one-hot encoding, frequency encoding utilizes the actual frequency or count of each category.

Here’s how frequency encoding works:

Calculate Frequencies: For each category in the categorical variable, count how many times it appears in the dataset.
Replace Categories: Replace each category with its frequency count. So, instead of having the original category names, you have numerical values representing how often each category appears.

Frequency encoding is particularly useful in scenarios where the frequency of categories holds valuable information for the prediction task. It can be beneficial when:

Handling High Cardinality: When dealing with categorical variables that have a large number of unique categories (high cardinality), one-hot encoding may result in too many columns, leading to the curse of dimensionality. Frequency encoding can help reduce dimensionality by replacing categories with their frequencies.

Preserving Information: Frequency encoding preserves the information about the frequency distribution of categories, which can be relevant for certain machine learning algorithms. For example, if the frequency of occurrence of a category correlates with the target variable, frequency encoding can capture this relationship more directly compared to other encoding methods.

It reduces dimensionality compared to one-hot encoding. Count encoding retains the original information about the frequency of each category in the dataset.

Drawbacks:

Loss of Category Label Information: Frequency encoding replaces category labels with numerical counts, which means the original category labels are lost. This may not be suitable if the category labels themselves hold important semantic meaning.

Sensitive to Outliers: Frequency encoding can be sensitive to outliers or rare categories with extremely high frequencies. These outliers may disproportionately influence the encoded numerical values, potentially affecting model performance.

While count encoding preserves frequency information, it discards any other meaningful information or relationships that may exist between categories. Count encoding can be sensitive to data imbalances.

When to use: This encoding technique can be useful when there’s a correlation between the frequency of a category and the target variable. Also applicable for categorical features with a lot of categories. Also, the count_encoder should be fit only on the train dataset. The fitted object should be used to transform test and out of time (OOT) datasets.

4.Target Encoding

Target encoding, also known as mean encoding, involves replacing each category with the mean (or some other statistic) of the target variable for that category. Here’s how target encoding works:

Calculate the mean of the target variable for each category.
Replace the category with its corresponding mean value.

There are two ways to implement target encoding

Mean Encoding: The encoded values are the mean of the target values with smoothing applied

Leave-One-Out Encoding: The encoded values are the mean of the target values except for the data point that we want to predict

Pros:

Supports High Cardinality: Target Encoder can be used in cases where there are many different categories, and it is better if there are multiple data samples for each category

Cons:

Target Leakage: Even with smoothing, this may result in target leakage and overfitting. Leave-One-Out Encoding and introducing Gaussian noise in the target variable can be used to address the overfitting problem

5.Rare Encoding

When applying the Rare Encoding technique, a threshold value is determined (which can be in the form of a ratio or frequency), and if values with frequencies lower than this threshold value are not desired, instead of individually representing these values, it may be preferred to group them together. This results in a decrease in the number of classes, combining previously numerous different classes that are likely not strong in terms of representation and have a low probability of being observed in the data. These variables, which are less likely to be observed, are all grouped together, and all of these classes are referred to as “Rare”. This process is called Rare Encoding.

There are various encoding techniques used for nominal and ordinal variables.

Reference:

Feature Encoding Techniques in Machine Learning with Python Implementation

https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/?source=post_page-----616e87bf8c74--------------------------------

How to handle outliers?

NigarAli — Mon, 08 Apr 2024 13:23:40 GMT

Outliers are data points that deviate significantly from the rest of data.They can affect the accuracy of predictions and should be treated appropriately.This involves either removing them or transforming them using a suitable technique.

There are several reasons to handle outliers in data analysis:

For linear models: Outliers can significantly affect the parameter estimation process in linear regression models. Therefore, it’s often advisable to handle outliers before fitting linear models.
When outliers are due to errors: If outliers are likely to be due to data entry errors or measurement errors, it’s generally a good idea to handle them to prevent these errors from unduly influencing the analysis.
When outliers affect distribution assumptions: If the underlying assumptions of your statistical test or model are violated due to the presence of outliers, it’s important to address them. For example, if you’re using parametric statistical tests that assume normality, outliers can affect the validity of your results.

In certain situations, there are cases that we shouldn’t remove outliers:

When they represent genuine data: Outliers may represent genuine variation in the data. In such cases, removing them could lead to biased results. It’s essential to understand the domain and context of the data to determine whether outliers are valid or not.
For robust models: Some models are inherently robust to outliers. For instance, decision trees, random forests, and gradient boosting models are less sensitive to outliers because they partition the feature space and make decisions based on regions rather than the entire dataset.
When data is scarce: In situations where data is limited, removing outliers may lead to loss of valuable information. In such cases, it might be better to use robust statistical techniques that are less influenced by outliers.
When outliers are of interest: Sometimes, outliers themselves are the subject of the analysis. In these cases, removing them would defeat the purpose of the analysis.For example,In fraud detection and anomaly detection tasks, outliers play a critical role in identifying unusual or suspicious behavior. Therefore, removing outliers in fraud detection scenarios is generally not advisable.
In cases of exploratory data analysis (EDA): During EDA, it’s often useful to identify outliers and understand their impact on the data distribution and relationships. Removing outliers prematurely can obscure important insights.

Techniques to detect outliers

There are several techniques that can be used to detect and treat outliers.Single feature methods focus on analyzing one feature at a time, while multivariate methods consider interactions between multiple features simultaneously.

Univariate outliers: outliers of objects that contains only one dimension

Visual Inspection:Visual inspection involves plotting the data and identifying any data points that are far away from the rest of the data.Box plots,scatter plots and histograms can be useful for identifying outliers visually
Z-score Method: The score method involves calculating the standard score for each data point.Data points with a standard score greater than a threshold value are considered outliers.The threshold value is typically set to 3,meaning that any data point with a Z-score than 3 is considered an outlier.

from scipy import stats
z_scores=stats.zscore(df)
abs_z_scores=np.abs(z_scores)
outliers=df[abs_z_scores>threshold]

3.Modified Z-Score Method:Similar to the z-score method, but uses the median and median absolute deviation (MAD) instead of the mean and standard deviation.Robust to outliers and suitable for datasets with non-normal distributions or containing extreme values.

4.Interquartile Range(IQR) method:

def outlier_thresholds(dataframe,col_name,q1=0.25,q3=0.75)
    quartile1=dataframe[col_name].quantile(q1)
    quartile3=dataframe[col_name].quantile(q3)
    interquartile_range=quartile3-quartile1
    up_limit=quartile3+1.5*interquartile_range
    low_limit=quartile1-1.5*interquartile_range
    return low_limit,up_limit

5.Grubbs’ Test:

Grubbs’ Test is a statistical test to detect outliers in univariate datasets based on the maximum deviation from the mean.It provides a formal hypothesis test for identifying outliers, suitable for scenarios where statistical significance is important.

6.Tukey’s Fences:Similar to the IQR method but uses a different multiplier to determine the fence for identifying outliers.Suitable for datasets with skewed distributions or those containing extreme values.

Multivariate Methods:

Density-Based Methods (e.g., DBSCAN)

DBSCAN is a clustering algorithm that can be used for outlier detection.Data points that do not belong to any cluster are considered outliers.

from sklearn.cluster import DBSCAN
dbscan=DBSCAN(eps=eps,min_samples=min_samples
dbscan.fit(data)
outliers=data[dbscan.labels_ == -1)

2.Local Outlier Factor (LOF)

LOF measures the local density deviation of a data point with respect to its neighbors. Outliers have significantly lower densities compared to their neighbors.

clf=LocalFactorialOutlier(neighbors=20)
clf.predict(df)
df.scores=clf.negative_outlier_factor_
np.sort(df_scores)[0:5]

3.Isolation Forest:This method isolates outliers by randomly partitioning the dataset into subsets and identifying anomalies as instances that require fewer partitions to isolate.

def random_cut_forest(data, n_estimators=100, contamination=0.1):
    
    model = IsolationForest(n_estimators=n_estimators, contamination=contamination, behaviour="new")
    model.fit(data)
    # The anomaly score of each sample is calculated as the mean anomaly score of the trees in the forest
    anomaly_scores = model.decision_function(data)
    return anomaly_scores

4.Mahalanobis Distance

Applies to multivariate datasets, measuring the distance of each data point from the centroid of the data in multiple dimensions, taking into account the covariance structure.

5.Robust Random Cut Forest

Suitable for datasets with high dimensionality and containing outliers.It’s robust to outliers and efficient for processing large-scale datasets with high-dimensional features.

6.Cluster-Based Methods: Utilize multivariate data to partition the dataset into clusters, identifying outliers as data points that do not belong to any cluster or form small clusters.

Techniques to handle outliers

Removing outliers: Removing outliers involves deleting the data points that are identified as outliers.However,this can result in a loss of data and can also affect the accuracy of the analysis
Winsorization: Winsorization is a technique for handling outliers by replacing extreme values with the nearest values that are not considered outliers.It is a form of trimming that can be useful for handling outliers

from scipy.stats.mstats import winsorize
import numpy as np

# Example dataset
data = np.array([10, 20, 30, 40, 500])

# Winsorization: replace outliers with values from the nearest non-outlying values
winsorized_data = winsorize(data, limits=[0.05, 0.05])  # Two-sided winsorization, trimming 5% from each tail

print("Original data:", data)
print("Winsorized data:", winsorized_data)

3.Data transformation:Transformation involves transforming the data,such as using a logarithmic or square root transformation,to reduce the impact of outliers on the analysis.Log transformation is particularly useful for highly skewed data.

4.Robust statistical methods:Robust statistical methods are particularly useful when dealing with datasets containing outliers, heavy-tailed distributions, or other forms of non-normality. They provide more reliable estimates and inferences in such situations and help mitigate the influence of extreme values on statistical analyses and modeling. However, it’s important to note that robust methods may have lower efficiency (greater variance) when the data conform well to the assumptions of traditional statistical methods.

Robust Measures of Central Tendency

Robust Measures of Dispersion

Percentile Bootsrap

Trimmed Mean

M-estimators

RANSAC(Random Sample Consensus)

Huber Loss

Winsorizing

MAD(Median Absolute Deviation)

Tukey’s Fences

Biweight Midvariance

Thanks for reading.You can follow me on linkedin.

My Experience with PwC Switzerland’s Job Simulation Program on Forage Platform

NigarAli — Sun, 07 Apr 2024 18:06:45 GMT

I recently completed the PwC Switzerland job simulation program on Forage, an exceptional platform for honing skills and gaining practical experience. Forage offers Virtual Work Experience Programs endorsed by top companies, providing opportunities for everyone, including international students, regardless of visa or work status. These programs comprise tasks and resources crafted to replicate real-world career scenarios.

In essence, Forage’s Virtual Work Experience Programs are primarily aimed at assisting students. By participating in a Forage Virtual Internship, you can gain insights into diverse career paths and develop the skills and confidence necessary for success in transitioning from academia to the professional world.

My journey involved immersing myself in hands-on experience in data analytics and data visualizations using Power BI. Now, I’m excited to share the steps of the program and my insights gained along the way.

First Task: Call Center

The initial task involved developing dashboards tailored for the call center manager, who lacked visibility into current trends. The objective was to create a dashboard enabling the manager to comprehend today’s trends effectively. Several key performance indicators (KPIs) and metrics needed analysis:

Overall customer satisfaction

Overall calls answered/abandoned

Calls by time

Average speed of answer

Agent’s performance quadrant -> average handle time (talk duration) vs calls answered

My Dashboard:

Key Insights:

In the analyzed dataset of call center activities, a total of 5000 calls were recorded, out of which 4054 were answered, leaving 900 unanswered calls. This resulted in 3640 resolved issues and 1354 unresolved issues. It was observed that the distribution of topics among resolved and unresolved calls remained roughly consistent.

Topic Distribution: The topic distribution across resolved and unresolved calls appeared to be similar. However, it was noted that the least unresolved calls were related to admin support, followed by streaming.
Agent Performance:
Highest Rated Agent: Dan received the highest rating among agents, while Diana received the lowest rating.
Most Answered Calls: Jim handled the highest number of answered calls, while Joe and Stewart handled the lowest.
Highest Resolved Calls: Jim and Dan achieved the highest number of resolved calls, whereas Stewart and Joe had the lowest resolved calls.
Active Days: Monday through Saturday emerged as the highest active days of the week, indicating increased call volume and engagement during these days.

Recommendations:

Agent Training and Support: Provide additional training and support to agents with lower performance ratings to improve their effectiveness in handling calls.
Topic-specific Analysis: Conduct a deeper analysis of calls related to admin support and streaming to identify potential areas for improvement or optimization.
Workforce Management: Allocate resources efficiently based on the observed trends in call volume on different days of the week to ensure adequate coverage during peak periods.
Performance Recognition: Implement a recognition program to acknowledge and incentivize agents who consistently achieve high ratings and resolve a significant number of calls.

The Second Task: Customer Retention

The subsequent task entailed designing a dashboard specifically focused on customer retention for the call center manager. This dashboard aimed to provide insights and recommendations concerning customer retention strategies.
The next step involved the following:

Identifying appropriate Key Performance Indicators (KPIs) relevant to customer retention.
Developing a comprehensive dashboard tailored for the retention manager, highlighting the selected KPIs.
Composing a concise email to the engagement partner, outlining the findings from the dashboard analysis and offering recommendations for necessary changes.

My Dashboard:

Key Insights

The influence of streaming services, such as Streamin TV and streaming movies, on customer churn rates appears to be minimal. However, there is a significant proportion of customers subscribing to our phone services who are churning. Therefore, it is imperative to investigate our phone service offerings to enhance customer retention. A considerable number of customers have opted for month-to-month contracts with us. Analysis suggests that factors such as marital status and customer dependency have limited impact on customer retention.

Third Task:Diversity and Inclusion

Handling Missing Data

NigarAli — Sun, 07 Apr 2024 15:15:45 GMT

Handling missing data is a crucial step in preparing data for machine learning models.

Types of Missing Data

When evaluating how missing data might affect registry findings, it’s crucial to understand why the data is missing. Missing data can generally be categorized into three groups:

Missing completely at random(MCAR)-

Randomly scattered in the dataset. Has much fewer null values than other types of missingness. Has no correlation with other variables. Can be due to technical or human error during data entry.

Missing at random(MAR)

Broader than MCAR. Randomness occurs only to specific groups of the data. For example, certain students missing their classes more during winter, the elderly leaving the Mobile OS field blank because they don’t know how to use a phone, etc. A clear distinction with MCAR is that MAR will always have some relationship with observed values.

Missing Not At Random (MNAR)

Final and most difficult case of missingness. It is randomly scattered in the dataset but the huge amount of missing values suggest some unobserved factor affecting the missingness. However deep you search, you will not find a relationship with existing features. This type will always have some systematic relationship with unobserved factors like leaving the IQ field blank because of embarrassment, leaving satisfaction score blank because customers could not fit their satisfaction into the given scores, etc.

One possible way to handle this problem is to get rid of the observations that have missing data.However,you will risk losing valuable information.A better strategy would be to impute the missing values.In other words,we need to infer those missing values fromthe exiting part of the data.

Assumptions:

Data is Missing at Random(MAR)
Easy to implement
No data manipulation required

Limitations:

Deleted data can be informative
Can lead to the deletion of large part of the data
Can create a bias in the dataset,if a large amount

When to use:

Data is MAR(Missing at Random)
Good for Mixed,Numerical and Categorical data
Missing datais not more than 5%-6% of the dataset
Data does not contain much information and will not bias in the dataset

Imputation Methods

It may not be always be logical to apply the same operations to columns with a small number of missing data and columns with a larger number of missing values.For example, a more accurate data set can be obtained by presenting different solutions to detect the data proportionally

missing_columns_info=missing_percent(df)
missing_columns_info

Columns with a missing values of more than 70% are dropped.

Imputing missing values:

There are many imputation methods for replacing the missing values. You can use different python libraries such as Pandas, and Sci-kit Learn to do this. Let’s go through some of the ways of replacing the missing values.

Replacing with an arbitrary value:

If you can make an educated guess about the missing value, then you can replace it with some arbitrary value.

Assumptions:

Data is not Missing At Random
The missing data is imputed with an arbitrary value that is not part of the dataset or Mean/Median/Mode of data

Advantages:

Easy to implement

Disadvantages:

Can distort original variable distribution
Arbitrary values can create outliers
Extra caution required in selecting the Arbitrary value

When to use:

1.When data is not MAR(Missing at Random)

2.Suitable for All

Mean,Median Imputation:

Numeric features like ‘Age’ can be imputed with their mean,median using fillna method.Therefore, the choice between mean and median for imputation depends on the distributional characteristics of the data in the column.If the distribution of a column follows a normal pattern, we can use the mean to fill in missing values. However, if the distribution is not normal, we should opt for the median instead.

Replacing with the mode:

Mode is the most frequently occurring value. It is used in the case of categorical features. You can use the ‘fillna’ method for imputing the categorical columns ‘Gender,’ ‘Married,’ and ‘Self_Employed.’

Backward Fill(or Backfill) and Forward Fill (or Ffill) Imputation:

These methods are often used in time-series data or datasets where observations are ordered based on some sequence.Therefore, careful consideration should be given before applying backward or forward fill imputation, especially in cases where the sequential relationship between observations is not clear or when missing values occur in clusters.

Interpolation:

Estimate missing values based on the values of other data points using interpolation techniques like linear or polynomial interpolation.andas’ interpolate method can be used to replace the missing values with different interpolation methods like ‘polynomial,’ ‘linear,’ and ‘quadratic.’ More sophisticated than simple imputation but may be sensitive to outliers.

Most of the imputation technique can cause bias.Simple imputation can result in an understimation of standard errors.As the number of missing data increases,simple imputation methods should be avoided.

Univariate approach vs Multivariate approach

We can impute missing values using the sci-kit library by creating a model to predict the observed value of a variable based on another variable which is known as regression imputation.

In a Univariate approach, only a single feature is taken into consideration. You can use the class SimpleImputer and replace the missing values with mean, mode, median, or some constant value.

In a multivariate approach, more than one feature is taken into consideration. There are two ways to impute missing values considering the multivariate approach. Using KNNImputer or IterativeImputer classes.

K-Nearest Neighbour(KNN) Imputation

One commonly adopted strategy for addressing missing data is to employ a predictive model to estimate the absent values.This technique entails developing a separate model for each input variable containing missing entries.

The default value of K is set to 5.Although there is no definitive method for determining the ideal value of K,a commonly used heuristic suggests that the optimal K is often the square root of the total number of samples.To identify the most suitable ,an error plot or accuracy plot is commonly used.

With this imputer, the problem is choosing the correct value for k. As you cannot use GridSearch to tune it, we can take a visual approach for comparison:

n_neighbors = [2, 3, 5, 7]

fig, ax = plt.subplots(figsize=(16, 8))
# Plot the original distribution
sns.kdeplot(diabetes.SkinThickness, label="Original Distribution")
for k in n_neighbors:
    knn_imp = KNNImputer(n_neighbors=k)
    diabetes_knn_imputed.loc[:, :] = knn_imp.fit_transform(diabetes)
    sns.kdeplot(diabetes_knn_imputed.SkinThickness, label=f"Imputed Dist with k={k}")

plt.legend();

Knn uses a variety of distance metric for the algorithm to function effectively.For example,Chebychev,Cosine Similarity,Euclidean,Minkowski,Manhattan,Hamming etc

MICE Imputation-’Multiple Imputation by Chained Equation’

MICE is the advanced missing data imputation technique that uses multiple iterations of Machine Learning model training to predict the missing values using known values from other features in the data as predictors

How does MICE algorithm work?

You basically take the variable that contains missing values as a response ‘Y’ and other variables as predictors ‘X’
Build a model with rows where Y is not missing
Then predict the missing observations
Do this multiple times by doing random draws of the data and taking the mean of the predictions

Thanks for reading: You can follow me on Linkedin

Referance:

How to merge dataframes in Pandas

NigarAli — Tue, 27 Feb 2024 22:38:02 GMT

As a data enthusiast and avid Pandas user, I often find myself grappling with the challenge of merging datasets effectively. In this article, I’ll share insights into various techniques, such as merge, join, and concat, which have become integral parts of my data manipulation toolkit.

Merge

Consider a scenario where you have two tables with common columns that you want to combine. The merge function in Pandas is a powerful tool that implements common SQL-style joining operations. There are different types of merges based on the dataset structure:

One-to-one: Joining two DataFrame objects on their indexes, which must contain unique values.

Many-to-one: Joining a unique index to one or more columns in a different DataFrame.

Many-to-many: Joining columns on columns.

The how argument in the merge() function specifies which keys are included in the resulting table. The options for how and their SQL equivalents are as follows:

left: LEFT OUTER JOIN (Use keys from the left frame only)

right: RIGHT OUTER JOIN (Use keys from the right frame only)

outer: FULL OUTER JOIN (Use the union of keys from both frames)

inner: INNER JOIN (Use the intersection of keys from both frames)

cross: CROSS JOIN (Create the Cartesian product of rows of both frames)

Consider two tables, each sharing a common identifier, employeeid

result=pd.merge(df1,df2, on=['EmployeeID'])
result

Output of the code

As we observe, the default option corresponds to an inner join.

Now,I’m about to combine data using the left join method.

result2=pd.merge(df1,df2,how='right' ,on=['EmployeeID'])
result2

Output of the code

When utilizing a left join, the resulting dataset provides a comprehensive view of the values from the left table. In cases where there is no corresponding match in the right table, these unmatched values will be denoted as ‘NaN’ in the merged dataset. This feature ensures that even non-matching records from the left table are retained in the final output, offering valuable insights into the data.

result2=pd.merge(df1,df2,how='right' ,on=['EmployeeID'])
result2

output of the code

When employing a right join, the merged dataset showcases the values from the right table. Any values without a corresponding match in the left table will be indicated as ‘NaN’ in the final output, allowing for a clear representation of the right table’s information in the merged result

result3=pd.merge(df1,df2,how='outer', on=['EmployeeID'])
result3

output of the code

When opting for an outer join, the merged dataset incorporates values from both the left and right tables. In instances where there is no match in either table, the missing values are filled with ‘NaN,’ offering a comprehensive view that includes information from both datasets.

result4=pd.merge(df1,df2, how='cross')
result4

output of the code

In the case of a cross join, the resulting dataset represents the Cartesian product of both the left and right tables. This means every row from the left table is paired with every row from the right table, creating an exhaustive combination of records. Unlike other join types, there is no consideration of matching criteria, and the merged dataset encompasses all possible combinations.

JOINS

In the realm of Pandas, join operations typically rely on indexes to align and merge datasets. However, in situations where the dataset lacks a predefined index structure, a practical approach is to employ the set_index method. This method allows for the establishment of a suitable index, paving the way for a smooth integration of datasets through the versatile join functionality. By explicitly defining the join based on the set index, this process ensures a coherent and accurate combination of data from multiple sources.

df3 = df1.set_index('EmployeeID').join(df2.set_index('EmployeeID'), lsuffix='_Left', rsuffix='_Right')

# Reset index to bring 'EmployeeID' back as a regular column
df3 = df3.reset_index()
df3

The aim of this code snippet is to merge two DataFrames, df1 and df2, using the 'EmployeeID' column. The set_index method is utilized to establish the 'EmployeeID' as the index for both DataFrames, facilitating the subsequent join operation. The lsuffix='_Left' and rsuffix='_Right' parameters distinguish columns with identical names. Finally, the reset_index function is applied to bring 'EmployeeID' back as a regular column in the resulting DataFrame df3

Output of the code

Similar to the merge method, the join method in Pandas allows for the use of different join types. This feature provides users with the freedom to choose the type of join that best fits their needs, be it a left join, right join, inner join, or outer join. This flexibility ensures a natural and intuitive approach to merging datasets in Pandas, accommodating various scenarios and enhancing the overall adaptability of the data manipulation process.

df3 = df1.set_index('EmployeeID').join(df2.set_index('EmployeeID'), how='inner', lsuffix='_Left', rsuffix='_Right')
df3

Output of the code

Concat method

Combine Pandas objects along a specified axis with the concatenate method. This function also provides the option for set logic along other axes, offering flexibility in handling overlapping labels on the concatenation axis. Additionally, the method allows the addition of a layer of hierarchical indexing, enhancing its utility for managing data with similar or overlapping labels.

result5=pd.concat([df1,df2])
result5

Output of the code

In this article, we explored the powerful world of data manipulation using Pandas, uncovering techniques such as merging, joining, and concatenating DataFrames.Happy Data Analyzing!

Hi, I’m Nigar, a data enthusiast passionate about simplifying complex data processes. Connect with me on Linkedin for more data insights and updates.