Data Preprocessing : The Foundation of Effective Data Analysis

Data Decoded
5 min readAug 6, 2023

--

Previous blog : Understanding Data Collection: A Closer Look at Our Data Analysis Project

Data Preprocessing:

Data preprocessing is a crucial step in data analysis and involves preparing and cleaning the raw data to make it suitable for further processing and analysis. It aims to enhance the quality and consistency of the data, reduce errors, and remove any unnecessary information that might hinder the accuracy of the results. Data preprocessing ensures that the data is in a standardized and usable format, making it easier to apply various analysis techniques and derive meaningful insights.

Why Data Preprocessing:

Data preprocessing is essential for several reasons:

  • Data Quality: Preprocessing improves data quality by eliminating errors, inconsistencies, and missing values, ensuring the data is reliable and accurate.
  • Consistency: It standardizes the data format, making it consistent across all records and sources, facilitating easier analysis.
  • Efficiency: Clean and organized data streamlines the analysis process, saving time and effort.
  • Accuracy: By removing irrelevant information and handling discrepancies, preprocessing enhances the accuracy of the analysis and insights derived.
  • Data Compatibility: It ensures data compatibility across different sources, allowing seamless integration and analysis.

Data Preprocessing in Our Project:

After extracting and segregating the data into “Input” and “Output” folders, we undertook various data preprocessing steps to prepare the data for analysis.

  1. File Format Conversion: Some files were in different formats like .xls and .xlsx. To maintain consistency, we converted all .xls files to the .xlsx format using Python.
# xls to xlsx

import pyexcel as p
import os
import pandas as pd
from openpyxl import load_workbook


directory = "File"

# Parent Directory path
parent_dir = "./"

# Path
output_path = os.path.join(parent_dir, directory)

os.mkdir(output_path)


input_path ="./input"
dir_list = os.listdir(input_path)


for k in dir_list:
if k[-1:-4:-1].lower()=="slx":
p.save_book_as(file_name='./input/'+k, dest_file_name='./'+directory+'/'+k[:len(k)-4]+'.xlsx')
elif k[-1:-4:-1].lower()=="xsl":
data = pd.read_excel(r'./input/'+k).iloc[:,:]

data.to_excel(r'./File/'+k,index=False)

2. Removing Metadata Rows: The first five rows in the Excel files contained metadata, so we removed them during preprocessing. Additionally, we deleted an extra blank row and the last row with the grand total to clean the data.

# removing metadata rows

import os
import pandas as pd


directory = "Files_Output"

# Parent Directory path
parent_dir = "./"

# Path
path = os.path.join(parent_dir, directory)

os.mkdir(path)


path =r"C:\Users\hp\Desktop./Files"
dir_list = os.listdir(path)


for k in dir_list:
dataframe1 = pd.read_excel(r'C:\Users\hp\Desktop./Files/'+k)
head=dataframe1[5:8]
reqdata = dataframe1.drop(dataframe1.index[range(8)])

t=[]
for j in range(head.shape[1]):
s=""
for i in head.iloc[:,j]:
if(str(i)!='nan'):
s=s+" "+str(i)
t.append(s)
data = pd.DataFrame()
j=0
for i in range(len(t)):
data[t[i]] = reqdata.iloc[:,j]
j=j+1
data.to_excel('./Files_Output/'+k,index=False)

3. Addressing Schema Mismatch in Power BI: While working with Power BI, we faced schema mismatch problems due to varying column structures among the files. To tackle this issue, we adopted a systematic approach during data preprocessing. Firstly, we extracted column names from all files and organized them in an Excel sheet. We then used different colors to mark rows of column names with the same schema. Then, we separated records with the same color into individual sheets, which made it easier to process and address the inconsistencies effectively.

Null value count with respect to each column
Color-coded schemas where each color represents a different schema

4. Standardizing the Schema using Python: To ensure seamless integration and analysis in Power BI, we needed a standardized schema across all files. For this purpose, we utilized Python to convert all color-coded files to match a common schema. The schema standardization process involved deleting any extra columns, renaming columns as required, and swapping them to achieve consistency. Once the schema was standardized, we successfully combined the files into the “FinalFile,” which served as the true source for both the “Input” and “Output” datasets.

# standardizing the schema

# importing required modules
%%time
import pandas as pd
pd.set_option('display.max_columns',1000)
pd.set_option('display.width',1000)
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

%%time
# schema_ = r"C:\\Users\\sahil\\Downloads\\DV_SchemaFix\\Marico\\"
schema_ = r"C:\\Users\\sahil\\Downloads\\InputFileSchema\\Godrej\\"
in_ = r"C:\\Users\\sahil\\Downloads\\InputFiles\\InputFinal\\Godrej\\"

schema_1 = r"C:\\Users\\sahil\\Downloads\\InputFileSchema\\Marico\\"
in_1 = r"C:\\Users\\sahil\\Downloads\\InputFiles_1\\"

schema_2 = r"C:\\Users\\sahil\\Downloads\\InputFileSchema\\Dabur\\"
FinalIpFile = pd.DataFrame()
def swap_columns(df1,col1,col2):
col_list = list(df1.columns)
x,y = col_list.index(col1),col_list.index(col2)
col_list[y], col_list[x] = col_list[x], col_list[y]
df1 = df1[col_list]
return df1
def modify(schema_path, in_path, color_file, change_col_name, change_col_list, swap, cols_to_swap, drop, cols_to_drop):
global FinalIpFile
file = pd.read_excel(schema_path + color_file,header=None)
file_name = file.iloc[0:,-1]
display(file_name)


for i in file_name:
print(i)
df = pd.read_excel(in_path + i)
df = df[:-1]
display(df)


if swap and drop:
for j in cols_to_swap:
df = swap_columns(df, j[0], j[1])
df.drop(df.columns[cols_to_drop],axis=1,inplace=True)
elif swap:
for j in cols_to_swap:
df = swap_columns(df, j[0], j[1])
elif drop:
df.drop(df.columns[cols_to_drop],axis=1,inplace=True)


if change_col_name:
for i in change_col_list:
df.rename(columns={i[0]:i[1]}, inplace=True)


if FinalIpFile.empty:
FinalIpFile = df
else:
FinalIpFile = FinalIpFile.append(df,ignore_index=True)


display(FinalIpFile)

Combining the files into final file:

%%time
# White Marico
kwargs={"schema_path":schema_1,
"in_path":in_1,
"color_file":'White.xlsx',
"swap":True,
"cols_to_swap":[(' [InputCGST Rate]',' [InputCGSTValue]'),
(' [InputSGST Rate]',' [InputSGSTValue]')],
"drop":True,
"cols_to_drop":[14],
"change_col_name": True,
"change_col_list": [(' [InputCGSTValue]',' [InputCGST Value]'),
(' [InputSGSTValue]',' [InputSGST Value]'),
(' Purchase/ PurReturn/Service',' Purchase/PurReturn/Service')]
}
modify(**kwargs)
%%time
# Orange Godrej
kwargs={"schema_path":schema_,
"in_path":in_1,
"color_file":'Orange.xlsx',
"swap":True,
"cols_to_swap":[(' [InputCGST Rate]',' [InputCGST Value]'),
(' [InputSGST Rate]',' [InputSGST Value]')],
"drop":True,
"cols_to_drop":[16],
"change_col_name": True,
"change_col_list": [(' Purchase/ PurReturn',' Purchase/PurReturn/Service')]
}
modify(**kwargs)
Standardized schema
Standardized schema

5. Eliminating Null Values: Columns with more than 60% null values were removed from the dataset. This improved data quality and eliminated unnecessary information.

Null value count with respect to each column
Null value count in each column
# Eliminating null values
Ipfinal=Ipfinal.drop(" Purchase Order Number", axis='columns')

Data Preparation Steps in Power BI:

Apart from the initial preprocessing steps, we also made adjustments to the data types and removed duplicates in Power BI to enhance the data analysis process.

1. Data Type Conversion: Numeric values were converted into appropriate formats, such as whole numbers, percentages, or currency, to ensure that the data is accurately interpreted and displayed in the most appropriate manner, enhancing the overall data visualization and analysis process.

2. Duplicate Removal: Duplicates were removed from each dimension table to ensure data integrity and accuracy. This step helped prevent data discrepancies and established reliable relationships between tables, leading to more trustworthy insights.

By undertaking these data preprocessing and data preparation steps, we ensured that the dataset used for analysis was of high quality, consistent, and in the most suitable format for meaningful data visualization and insightful analysis.

Next blog : Data Modeling: Streamlining the Data Flow

--

--