Getting Started with Kaggle Problems

Divyansh Rai
Nybles
Published in
6 min readJul 1, 2020

“Have I learnt enough of skills to start participating in Kaggle Competitions?”

Ever felt daunted by this question ?

This is the kind of problem for which the phrase, “ Till the moment you don’t step into water, you can’t know how deep it is”, is used. So,the rescue to the problem is to start and finish a Kaggle Problem itself.

The prerequisites to the blog are -

  1. Basic Terminologies of ML
  2. Basic knowledge of Python libraries related to ML and Data Visualization

Through this blog I have explained my approach to solve the renowned problem of “Housing Prices Competition for Kaggle Learn Users”: https://www.kaggle.com/c/home-data-for-ml-course. By the end of the blog you will be having a general idea about an approach to dealing with some datasets, and how the choice of the model is made.

Home Page of the specified Competition

The link to my solution to the problem is: https://www.kaggle.com/raidivyansh/us-housing-221?scriptVersionId=36243473.

You may not worry and look into the notebook, as there is no harm in getting a general intuition.Keep reading the code alongside the blog for comprehensive and superior understanding.

For the part if you don’t know how to use Kaggle Kernels, you may refer to

Getting Started

Firstly, go through the problem description,evaluation methods,FAQs and tutorials which help in getting a gist about what the problem demands and what all new concepts you may be using.

1. Observing the data

Hoping that you have imported the required libraries and datasets (train.csv and test.csv) to the kernel. The best strong opening is to have a glance of how the data looks to just have a vague idea of things such as the number of columns,what kind of inputs (categorical or numerical) each column contains.

Then, look for stuff such as the number of NaNs per column,and then plot its bar graph so that it become more interpretable. Then you may divide all columns of categorical and numerical data types under different variables.

2. Visualizing the data

Analysis of data through plots is one of the most important part of the solution, it gives to you gist of how varied the datasets are or a rough idea of boundaries for the outlier removal. Using the seaborn library, you make several DistPlots and BoxPlots, for observing distribution of each variables, and ScatterPlots and RegPlots, for observing relationship between variables and output(i.e. Sales Price , here).

Few of the Scatter Plots(Left) and Box Plots(Right)

3. Data Cleansing

  1. Remove all the outliers detected through the plots,note that the choice of boundaries of outliers completely depends on your observation and interpretation of the data set.
  2. Check for highly correlated features in the numerical data set, withdraw one column from each pair of correlated feature columns.
  3. Removing those features that ‘are almost constant’ or ‘are almost not defined’, for the whole training data set, as they make no influence on the prices (being the same for every example thus acting as a constant).
  4. Look accordingly if you need to remove some other columns due to the already removed columns(“Like if the PoolQC variable is removed then we won’t be needing the PoolArea variable even”).
Heatmap showing correlation between numerical features(1 for (cor>0.8))

4. Data Engineering

This is considered to be the most delicate part of the solution, most of the decisions taken here will be affecting your model accuracy at the end. So, here in this part you fill the given NaN values, create new variables, and perform a few more of such interesting stuffs.

4.a. Dealing with NaNs

For filling the NaN values in both test and training files ,you may apply any of the approaches-

  1. Fill all the NaN with zeros(or None).
  2. Fill the features with few NaNs or having considerably low importance to zeros , and the rest of them to be filled in some logical way as in using median(or mode).
  3. Other efficient way to deal with missing values is to treat them as labels and use the rest of the filled column values to predict these missing values.One major drawback to this is,if there is no correlation between attributes with missing data and other attributes in the data set, then the model will be bias for predicting missing values. For a newbie in ML , this approach may seem difficult.

4.b. Creating New Variables

Formation of new Variables as there can be many variables such as if a house has or not a second floor ,variable named “Has2ndFlr”, and you at times are even seen collaborating few of the related kinds of variables under some variable name.

4.c. Datatype Modification

You even change the datatype of columns to the suitable type(such as the variables based on years, they may have numerical values but each year are similar to some category only, so their datatype must be string).

4.d. Removing Skewness

Though real life data are generally skewed. If there is too much skewness in the data, then many statistical models don’t work. As the part of data which is less in number are considered to be outliers.

Outliers adversely affect the model’s performance, especially regression-based models,thus we try performing certain functions on the output(SalesPrice), so that our output becomes less skewed.

You have worked hard over your data, and now it is in a lot more functional form from the problem’s point of view. Be ready to process your data on several models.

5. Model Selection

If this is your first Kaggle Problem, then the above parts must have seemed intuitive to you, and you might have grabbed a proper idea of working with data. But this part won’t go that easy as for you this maybe the first time working with these many models, and you need to now read articles over Models and Encoding(for dealing with categorical values) functions.

Divide the training data set into 2 parts:Training(80%) and Test set (20%).Further, performing steps such as OneHotEncoding, SimpleImputer so that all our data become processable.

We need to have knowledge of several problem related models and their proper hyper parameter tuning, which we may use. Hyper Parameter tuning is an important part as their wrong choice can make the perfect models even perform badly. Though in this notebook the part of Hyper Parameter optimization is not performed well, and the values are picked from several tutorials directly.

Then calculate the loss (here, MeanAbsoluteError). If there is a significant difference in loss values, then the model with minimum loss must be chosen to predict the expected Sales Prices of the houses on the test set.

The above values are small (close to 0) and thus we are getting accurate results using the Model chosen.

6. Submission

What matters the most is the output. So, last but not the least, submit your predictions .

Congratulations you have finally made it to the leaderboard. You can make several attempts, steer through with some better approach to the problem,and climb up the leaderboard.

Conclusion

For every beginner in ML, the first Data Science problem he works on seems a bit challenging and requires a lot of patience. Through this blog, I have tried providing an approach to solving any such data related problems.

“YOU NEVER KNOW WHAT YOU CAN DO UNTIL YOU TRY”

The Kaggle Community is an awesome place to learn great ideas through public Kernels and Discussion on Forum ,and to get proper cleaned up datasets easily.Kaggling seems bit overwhelming at first,but once you are in,trust me you are gonna love it.

So, KEEP CALM and KAGGLE 😉.

--

--

Divyansh Rai
Nybles
Writer for

Undergrad at IIIT ALLAHABAD | Machine Learning Enthusiast | Avid Reader