Introduction to Data Preprocessing in Data Mining

Data preprocessing with Weka software

Sajeevan Wickramarathna
Tech x Talent
6 min readJun 3, 2021

--

Google Image

Data preprocessing is one of major technique used in Data Mining which is used to transfer raw data in to useful and effective format. Data in the real world is incomplete, inconsistent and noisy.

Incomplete : lacking attribute value, lacking interest on certain attribute and contains only aggregated data. As an example if a form has an optional data field to fill then the database does not contain any value for that.

Incomplete data may come from;

  1. When collecting “Not Applicable” data values
  2. Problems with human, software and hardware
  3. Different considerations between the time when data was collecting and analysis it

Inconsistent : containing discrepancies in codes and names in the data set. As an example for a certain application they consider rating as “1,2,3,4,5” but now they are taken it as “A,B,C,D,E”

Inconsistent data may come from;

  1. Fault data collection instruments
  2. Human or computer errors when data entering
  3. Errors in data transmission

Noisy : Data set is contained with erroneous data and outliers. As an example salary cannot be a negative value.

Noisy data may come from;

  1. When considering different data sources
  2. Functional dependency violations.

Therefore incomplete, inconsistent and noisy data will lead to Dirty data. If we do not have quality data and we can not get a quality mining result. What can we do for get quality data ????

Solution is Data Preprocessing !!!!!

If we focus on Data Preprocessing it has major tasks as follows;

1. Data Cleaning

Data cleaning involves to filling missing data and removing noise data. Missing data is data which is not available. To address the problem of missing data we can ignore values with missing (not an effective way) or we can treat missing value with a separate value or fill missing values with mean / median values. Then considering removing noise data we can use Binning methods (sort and partition into bins), Clustering (detect and remove outliers)or Regression (smooth data by regression function).

2. Data Integration

Data is coming from different data sources and we have to combine them to get a single consistent data source. Therefore when we integrating multiple data sources redundant data will be generated. This will lead to Object identification which means the same attribute name occurs in different names in different databases and Derivable data which means one attribute may be a “derived” attribute in another table. Thus, these types of redundant data may be cure by detecting them using correlation analysis.

3. Data Transformation

Data transformation involves with transforming data into appropriate forms by using Smoothing, Aggregation, Generalization, Normalization and Feature construction. There are certain different techniques under those tasks and as an example to Normalize data we can use min-max normalization and z-score normalization.

4. Data Reduction

Data reduction is used to reduce redundancies and identify most important parameters which can be used for decision making process.

5. Data Discretization

Data discretization involves partitioning data relevant to their patterns. Therefore in data discretization we use Binning methods under Equal width partitioning and Equal depth partitioning. In the Equal width partitioning it is considering a certain number of bins with equal intervals. In the Equal depth partitioning, divide the range into N intervals by getting approximately equal number of samples.

After knowing major concepts behind the Data Preprocessing we can apply those techniques to a data set to get the practical knowledge. We use Weka software to accomplish that task.

You need to install Weka Software into your machine and then follow below steps for practice Data Preprocessing through Weka.

1.Open Weka GUI and click on “Explore” button

2. Click “Open File” under the Preprocess tab under Weka Explorer and choose a data set to preprocess and open it. If it is a numerical data set it is most preferred because you can observe how to do data preprocessing in a clear manner.

3. Under the “Preprocess” tab there is a “Edit” button and by clicking it you can see your data set with its attributes.

4. You can set a missing value by just clicking on the attribute value and selecting empty space. Not only empty values you can set other values which are coming under that attribute. By using this you can assign randomly missing values to any place.

Before assign missing value
After assign missing value

5. Then you can set missing values by right click on top of the Attribute name and select “Set missing values to…”. A popup window will show on your screen as follows and you can enter suitable data for that attribute space. By doing this you can manually set missing values in an individual manner. And also in the same list you can see an option called “Get mean”, by that you can see the mean value of the data replace it with the missing value space.

Pop up window for setting a Missing Value

6. Choose a filter by click “Choose” button under the Preprocessing tab.

7. Then you can select either “Supervised” or “Unsupervised” filter. Supervised filters are used when your data set contains a Class attribute and unsupervised filters apply when the Class attribute is not defined. In this sample data set has used Unsupervised filters.

Then go to the attribute sub directory and select it as “ReplaceWithMissingValue” and select it. This filter will help you to generate missing values in your data set randomly. A brief description about the filter will show when just hovering the mouse pointer near the filter.

Then you can observe that the selected filter is mentioned in the bar next to the choose button. If so you can right click on the filter then click “Show properties”.

Under the properties of the filter, you can see there are some fields you to customize or if you like you can keep them as it is. As an example, you can change the probability of scattering missing values in your data set as below given figure. After editing them click “OK”.

8. Now your filter is ready to apply for your data set. Therefore click on “Apply” button.

Now the “ReplaceWithMissingValue” filter is applied on your data set and you can see that by looking into your data set.

There are more filters that you can apply to your data set but in this article will provide two more filters.

  1. ReplaceMissingValues > By applying this you can fill missing values in the data set by the mean of the sample data set.
  2. ReplaceMissingWithUserConstant > This filter provides you to choose the separate attribute and replace value for the missing values under that attribute. But here it is important to notice that, you can assign same value for those missing value under that attribute.

Conclusion

This article provides you what is Data preprocessing, Data preprocessing techniques and how to handle data preprocessing using Weka. Hope you have enjoyed my article 🙂.

Thank you for reading!!!!!

--

--

Sajeevan Wickramarathna
Tech x Talent

Final Year Undergraduate at Faculty of Information Technology, University of Moratuwa. Tech enthusiast and Dedicated.