This post is a small look into my best model for predicting faulty water pumps in Tanzania. My work includes the use of Kaggle API, OrdinalEncoding(), StandardScaler(), and RandomForestClassifier(). I will go through every step needed to come to the same model I did. I’ll begin by bringing inthe data and end with submitting to Kaggle.
I begin by pulling the Kaggle dataset through the API and making DataFrames out of the datasets using pandas.
I found that one of the columns, construction_year, had a null value that wasn’t necessarily listed as a NaN value.
I use the mode for the dataset’s column after removing all of the rows with 0. I then replace all of the rows with 0 with 2010, the mode.
I go onto some data wrangling. I combine the train and test datasets in order to apply some edits. My first edit was adding columns for binned longitude and latitude. My second edit was dropping columns of high cardinality. I then used Ordinal encoding on the combined dataset. Ordinally encoding the train and test datasets separately would have made the encoding different for both datasets. This, in turn, would have made for a faulty prediction using the model I made.
Content with the data wrangling, I split the training data into train and test using train_test_split().
For my model, I use StandardScaler() and RandomForestClassifier(). The ‘n_estimators’ argument in RandomForestClassifier() in basic language improves the prediction the higher it is. However, the higher the number also the longer the model takes to churn out predictions.
After running the model for about 10 minutes (yes 10 whole minutes) I’m able to check to see what the accuracy score is for the prediction on X_test.
The score I get is good enough to get into the top 10 on the leaderboard so I go to post the score on Kaggle. First I must use the model to predict the labels for the test features. After I have the array of answers, I turn it into a DataFrame and edit the DataFrame to be the same as the sample submission. I make that DataFrame into a CSV file and download it from Google Colab. I then submit the CSV to Kaggle and get my score back.
This model was the tenth and last one I came to in my week long journey for the highest accuracy score. I used LogisticRegression(), RidgeRegression(), and many combinations with different categorical encoders and data scalers and data wrangling.
While LogisticRegression() came up with a score higher than baseline, it did not come close to RandomForestClassifier()’s performance.