Supervised Machine Learning: A Crash Course in Data

Manohar Varma

Published in

AI Skunks

8 min readMar 13, 2023

Data Integration and Pre-Processing of Data

A detailed guide on data intuition for any Machine Learning project by Krishna Manohar Varma Indukuri, Nik Bear Brown

Abstract

In this study, we’re going to learn the importance and steps involved in successfully integrating & pre processing any given data that can be used for analytical and machine learning purposes.

We took a real world example on how to predict the weather on Michigan lake by taking the data from two different sources. After integrating the disseminated data into one piece, we start to pre-process data.

What Is Data Integration?

Data integration combines various types and formats of data from any source across an organization into a data lake or data warehouse to provide a unified fact base for analytics. Working from this one data set allows businesses to make better decisions, aligns departments to work better together, and drives better customer experience.

What are data integration tools?

Data Integration tools are software-based tools that ingest, consolidate, transform, and move data from source(s) to destination, performing mappings, transformations, and data cleansing along the way. Ultimately, they integrate the data into a ‘single source of truth’ destination, such as a data lake or data warehouse. This allows consistent, reliable data for use in analytics and business intelligence.

Example of how human brain works with respect to data integration

How humans interpret that this little animal is a koala but not a squirrel or a big rat?

There are certain attributes or features that makes this koala species different from others. Lets discuss on how human brain comes to a conclusion whenever we see this animal to koala.

It’s nose is one main feature that separates koala to stand out from rest of the similar species. In Machine learning language highest contributing feature with 0.7/1 weightage. Rest 0.3 or 30% is from remaining features like ears & eyes.

So, similarly with business data we need several important attributes to be combined to make sense to the desired output. As shown above only having nose or ears or eyes doesn’t make much sense but combining into one whole picture (koala image) gave us the necessary data to get the desired output called KOALA.

Same theory applies to the data related to business as well. But often the data becomes fragmented, poorly organized and stored in several different storage systems. Being a data science professional, has to be clear about business problem and what has to be achieved from data & where to find and collect all the necessary data.

Real world problem

Let’s take an example of real world scenario: Snowfall prediction on lake Michigan
We have two different data sets coming from two sources. One is from Geospatial satellite for latitude & longitudinal data on corresponding time. Other is from NOAA satellite with take hundred's of images of earth surface, preprocess the images and gives back data collected by analyzing the images. Let’s import the data using pandas & dive deep into the topic

import pandas as pd
df1 = pd.read_csv('https://raw.githubusercontent.com/Varmai/Data-sets/main/Date_time.csv')
df2 = pd.read_csv('https://raw.githubusercontent.com/Varmai/Data-sets/main/Cloud%20%26%20wind.csv')

Let’s integrate the datasets with python method merge() by using common column named ‘id’

df = pd.merge(df1, df2, on='id')

Now we have successfully joined the two data frames & we’re good to move towards pre-processing.

What Is Data Preprocessing?

Data preprocessing is a step in the data mining and data analysis process that takes raw data and transforms it into a format that can be understood and analyzed by computers and machine learning.
Raw, real-world data in the form of text, images, video, etc., is messy. Not only may it contain errors and inconsistencies, but it is often incomplete, and doesn’t have a regular, uniform design.
Machines like to process nice and tidy information — they read data as 1s and 0s. So calculating structured data, like whole numbers and percentages is easy. However, unstructured data, in the form of text and images must first be cleaned and formatted before analysis.

Data Preprocessing Steps

Let’s take a look at the established steps that needs to go through to make sure the data is successfully preprocessed.

Data quality assessment
Data cleaning
Data transformation
Data reduction

Let’s compute all the above processes with the help of code

1. Checking for null values in the data is one of the important step in assessing quality of data

tabel = pd.DataFrame({
    'Unique':df.nunique(),
    'Null':df.isnull().sum(),
    'NullPercent':df.isnull().sum() / len(df),
    'NaNN':df.isna().sum(),
    'Type':df.dtypes.values,
})
tabel

By the help of above code we will be able to determine all the missing values & constants in all of the features/columns in the dataset. If there are no missing values we can skip the step 2 (Imputation) and move to step 3.

2. There are several imputation techniques to insert for missing values based on the type of data & distributions.

Mean or median imputation
Regression imputation
K-nearest neighbor (KNN) imputation
Last observation carried forward (LOCF) imputation
Next observation carried backward (NOCB) imputation
Maximum likelihood imputation

Once Imputation is done for missing values we can move towards data transformation & dimensionality reduction by checking correlation between features

3. Data transformation:

Converting categorical features data type to integer type as computers only understand numbers. Converting categorical data to numerical data can make it easier to analyze using statistical and machine learning algorithms.

from sklearn.preprocessing import LabelEncoder

for column in df.columns:
    fields = []
    if df[column].dtype == 'object':
        encoder = LabelEncoder()
        df[column] = encoder.fit_transform(df[column].values)

4. Data Reduction:

All the highly correlated features will be removed because the model expect only one dependent feature & rest all as independent. But in real world one value is dependent on another value in some way or the other. So, the correlation index can be kept between 60–80% based on how important the feature will be for the prediction.

import matplotlib.pyplot as plt
import seaborn as sns

plt.subplots(figsize=(30,20))
sns.heatmap(df.corr(), annot=True)
plt.show()

One brief example:

For calculating BMI index we need features like Weight, Height, Age, Daily calories intake, Physical Activity etc. and output/dependent feature will be BMI. But if we calculate correlation between features the height & weight is dependent of person’s age for age < 18. Calories intake is dependent on physical activity. An athlete consumes 4000–5000 calories still stays lean while a sedentary person consumes 2500 calories and still becomes fat. So, lifestyle matters. If we closely observe there’s correlation between the independent features itself, meaning more dependent variables in data. Still all the columns are required to calculate for accurate BMI index of a person.

So, that is where data science comes in. Being a data scientist, one has to understand data and then compute the data. Without understanding data we will drive the algorithm towards false and biased model.

Feature reduction

Feature reduction, also known as feature selection or variable selection, is a technique used in machine learning and data analysis to reduce the number of features or variables in a dataset. The goal of feature reduction is to simplify the dataset by removing redundant or irrelevant features, while retaining the most important features that are relevant for analysis or modeling.

import numpy as np 

cor_matrix = df.corr().abs()
print(cor_matrix)

upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
print(upper_tri)

to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.80)]
print(to_drop)

df = df.drop(columns = to_drop, axis=1)
print(df)

All the columns above 80% correlation be removed from dataset as per the above code which completes the data reduction in pre-processing.

This completes most of the pre-processing part except two more important steps. The data often has varying distributions which needs to be taken care else the model will be biased towards one feature/category

Scaling the data to remove bias

Scaling the data is an important preprocessing step in data analysis and modeling workflows. The main reason for scaling the data is to ensure that all variables or features are on a similar scale, which can help to improve the accuracy and efficiency of machine learning algorithms. Scaling can also help to reduce the impact of outliers.

from sklearn.preprocessing import MinMaxScaler

X = df.drop(columns='LES_snowfall')
y = df['LES_snowfall']

scaler = MinMaxScaler(feature_range=(0,1))
X = scaler.fit_transform(X)

This above code converts all the values of independent features between 0 to 1. This makes sure there’s no bias towards any particular feature while computing the algorithm.

Balancing dependent variable

Let’s assume 90% of data shows no snowfall and only 10% data shows snowfall in the output feature. This data itself has bias towards no snowfall category. Model predicts even for snowfall days as no snowfall in future cases as algorithm is trained towards bias(no snowfall) which is worse that gives inaccurate predictions. So, to remove future complexities we will balance the data with 50–50 weights.

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X, y = smote.fit_resample(X, y)
ros_chd_plot=y.value_counts().plot(kind='bar')
plt.show()

This step concludes the pre-processing of data. Now we’re good to fit the data into any machine learning model.

Conclusion

Data integration is the process of combining data from multiple sources into a single, unified view. The goal of data integration is to create a complete and accurate representation of the data, which can be used for analysis and decision-making.

Data preprocessing is the process of preparing data for analysis by cleaning, transforming, and organising it. The goal of data preprocessing is to improve the quality of the data and to ensure that it is suitable for analysis. Data preprocessing is important because it can help to identify and correct errors in the data, reduce noise, and improve the accuracy of analysis.

In detail explanation with code is provided in GitHub.

References:

https://streamsets.com/learn/data-integration/?gclid=Cj0KCQiAutyfBhCMARIsAMgcRJTnyXp3aC2sAp8LsumeCf0P0CwF2U_oe4CqKREfSrz88UyNnT0p2BIaArnAEALw_wcB

https://monkeylearn.com/blog/data-preprocessing/