How To Implement Data Preprocessing In Weka?

Manankumar choraria
4 min readJun 11, 2022

--

As described in the previous article Weka is a vital tool for performing the different data mining tasks. We had also loaded the Vote dataset in Weka and now we will perform the Preprocessing task on this dataset. Removing the missing values is a necessary task when dealing with datasets in Data Mining as it can generate unusual results. Let’s first discuss the dataset and then we will apply the Remove operation of the missing value on the dataset.

About Vote Dataset

Vote dataset consists of 1984 United States Congressional Voting Records. It includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to y), voted against, paired against, and announced against (these three simplified to n), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition or missing value).

The dataset consists of 17 attributes with 16 features and 1 class. The class consists of two categories i.e., republican and democrat. 16 attributes include handicapped-infants, water-project-cost-sharing, adoption-of-the-budget-resolution, physician-fee-freeze, el-Salvador-aid, religious-groups-in-schools, anti-satellite-test-ban, aid-to-Nicaraguan-contras, MX-missile, immigration, synfuels-corporation-cutback, education-spending, superfund-right-to-sue, Crime, duty-free-exports, export-administration-act-south-Africa. Here all features and class name takes only Boolean value i.e., y or n. There is a total of 435 instances, of which 267 are democrats and 168 are republicans.

Missing Values in Vote Dataset

There is total 288 missing values in the given 17 attributes of dataset. The list of missing values in each attribute is as follows:

Attribute no: No. of Missing values= [1: 12, 2: 48, 3: 11, 4: 11, 5: 15, 6: 11, 7: 14, 8: 15, 9: 22, 10: 7, 11: 21, 12: 31, 13: 25, 14: 17, 15: 28, 16: 104, 17: 0]

There is a minimum of 0 missing values which are present in attribute number 17 which consists of class attributes containing republican and democrat. Maximum there are 104 missing values which are present in attribute number 16. It is also shown below by Weka:

Now, we will fill these missing values with the help of Weka Preprocessing.

Replace Missing Values Through Preprocessing in Weka

As stated above, there is a total of 6 operations available in Weka and we are going to deal here with Preprocessing. Select the first operation from the operation bar (it will be selected by default as soon as you will launch explorer), which is as shown below:

Operation Bar in Weka

Choose the filter option which is just below the Open file option, as shown below:

Filters in Weka

It will open a menu, which will show a directory and file structure consisting of filters as root

Just click on the drop-down button beside filters and it will open categories of filters including All filters, supervised and unsupervised

Click on the drop-down menu besides unsupervised which will have two subcategories attribute and instance. Then choose the attribute option and it will display different options available that you can perform on your data. It will also include ReplaceMissingValues. Just hover the mouse over the option and it will show details about the option i.e., how missing values will be replaced. It chooses between two options i.e., either mean or mode.

Just choose the option and select apply option, which is located just after choosing the filter option, as shown below:

Selecting Replacing Missing Values

Now as the filter will be applied, you will see that missing values in each attribute will be filled, which can be seen in the output below:

Conclusion

Now as the Preprocessing is done, We would see the classification and Apriori Algorithm using Weka in the Next articles. Just Stay Tuned For the Updates!!

--

--