How I improved my score by 12% on Kaggle’s May 2021 tabular competition by using perceptron
As those individuals who have been following my blog posts will know, I have been working on Kaggle’s May 2021 tabular competition for close to a week now. I have been working on improving the accuracy of this competition and posting about the different techniques that I have used to achieve this. My most recent post on this subject is:- How I improved accuracy 5% on Kaggle’s May 2021 tabular competition by employing multiple outputs | by Tracyrenee | MLearning.ai | May, 2021 | Medium
Although I said that I was not going to work on this problem any more, the accuracy was so low that I decided to try to use the perceptron estimator to see if it would improve. The perceptron is a simple classification algorithm that is suitable for large scale learning. It does not require a learning rate, it is regularised, and it updates its model only on mistakes.
In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.
Through the course of my experimentation, I have tried several other models, such as SGDClassifier, MLPClassifier, PassiveAgressiveClassifier, and Polynomial Features, and I have been unable to achieve results close to those I have achieved with Perceptron.
The program for this competition question has been written in a Jupyter Notebook for my personal Kaggle account. The libraries that I needed to use are already installed in the Jupyter Notebook, so I only needed to import the ones that I would need. I started out with numpy for linear algebra, pandas for dataframe manipulation, and os to get into the directory where the csv files are stored:-
I then read the files that I had retrieved from the operating directory. They are train, test and submission:-
I analysed the target and discovered there is a class imbalance. Most of the examples fall under class 2. What is interesting in this competition question, however, is the fact that target is one column but the predictions are one hot encoded. In addition, there appears to be a requirement for a value in each class, which is something that I have never seen before:-
Because the submission is comprised of one column for each class, I decided to one hot encode the target column:-
I then dropped the target column from the train dataset because this data will be used elsewhere in the program:-
I combined the train and test set to form the dataframe, combi:-
The id column is not necessary to make predictions, so I dropped this column from the combi dataframe:-
I then normalised combi to give each cell a value between 0 and 1 because it is easier to make predictions when the data is formatted in this manner:-
Once the data had been preprocessed, I defined the variables that will be used to make predictions. X is the combi dataframe with rows from 0 to the length of train, X_test is the combi dataframe with rows from the length of train to the end, and y is the one hot encoded target variable:-
After the variables had been defined to make predictions on the data, I used sklearn’s train_test_split() function to split the datasets into training and validation sets:-
I then selected the model. In this instance I decided to have a stab at Perecptron to see if I could achieve a better score in the competition:-
I then predicted on the validation set:-
Once the model had been trained, fitted and predicted on, I made predictions on the test set, being X_test:-
I prepared the submissions by assigning one column of the predictions into the submission dataframe:-
When I submitted the predictions to Kaggle, to my amazement I achieved a score of 5.97:-
For reasons I don’t entirely understand, Kaggle wants to have a value in each cell in the dataframe. Although I am not aware of the reasons for this type of competition question, one thing I do know is that programmers have to give the client whatever it is that they want.
The code for this program can be found in its entirety in my personal Kaggle account, the link being here:- Tab — May 2021 perceptron | Kaggle