Kaggle Competitions Beginner Walkthrough: Train a Machine Learning Model and Submit its Predictions

Nischala Nagisetty
Women in Technology
8 min readMay 6, 2023
Photo by Giorgio Trovato on Unsplash

Entering your first Kaggle competition can be an exciting but also overwhelming experience. However, the prospect of competing with experienced data scientists may feel intimidating to beginners. In this article, I can help you submit to your first Kaggle competition!

The first step is to create a Kaggle Account. Once you create an account and are redirected to the home page, you should see a ‘Competitions’ section on the left-hand side banner. Click on this.

Your screen should look like below:

On the Competitions page, users can filter competitions by categories such as “Featured”, “Research”, “Getting Started”, and “Playground”. Each competition has a detailed description of the problem statement, evaluation criteria, and rules.

Today we will be working through the California Housing prices dataset. To do this, just search for ‘house prices’.

Click on the result that says House Prices-Advanced Regression Techniques. You should reach that specific competition’s homepage as seen below.

The home page of a Kaggle competition is the main hub for all information related to that particular competition. It serves as a central location for participants to access the dataset, submit their predictions, and engage with the competition community.

First click Join Competition in the banner and once you have joined, you can click on the ‘Code’ section. Here there will be a button for New Notebook.

Once you click ‘New Notebook’, you should be redirected to a page that looks like below. Feel free to name your notebook whatever you would like.

Delete all except the following:

This will import OS and walkthrough the file system in Kaggle.

Now we will run this set of code.

Make a new cell and import pandas. Then we are going to read the training data frame by copying the file path for the train.csv. Your code should look like below after you run it.

Do the same with the test data frame. It should look like below after the cell has been run.

We are going to start by making a really simple model. Instead of a random guess, we are going to be using the average of all the SalePrice values. Run the average for the training dataset. It should be around 180 thousand.

So now we can make a new column in the test dataframe that is equal to the average values we created in the train data set. Make sure that in your code you account for all rows by using the len() function. Your code should have an output like below. Notice the addition of the SalePrice column and how they are all the average we calculated before from the training dataset.

We can submit this now! First we need to get a subset of the current test dataset we have now. For the competition, they asked for only the Id column and SalePrice column. So we are going to make this into a csv.

Once you run this code, you should notice on your right hand banner the output.

Hover over your csv file name and there should be three little dots. Once you click that, you should be able to download your csv file. You can open the file to check your work before you submit. It should look like this.

Since everything looks good, save your notebook by clicking ‘Save Version’. It should bring up an option to Quick Save.

Now let’s go back to the competitions page. Find the House Prices Competition page again. This time click on the black button that says Submit Predictions. It will give you an option to upload your csv file. Upload the csv file we downloaded before.

Click submit!

It will lead you to the Submissions page where you can see your score.

Congrats on your first submission! Now we can add more elements to increase your score.

Click on Code and then the ‘Your Work’. You should find your notebook that you just used in there. Open the notebook. Click edit, and we should be able to continue working in this notebook. Run all the cells again before you create a new cell block.

This time, our objective is to develop a more accurate machine learning model compared to our previous submission.

For this model, we are going to use the LotFrontage and LotArea values as our inputs to predict the SalePrice. We will make a data frame with these two columns. To be safe, we are going to make a deepcopy so the original data frame will not be affected by the new changes. Check if your code matches the one below.

Our plan is to make a linear regression on the input columns(LotFrontage and LotArea) that maps to the output column(SalePrice).

To do this, we are going to make a training input matrix. This will be the variable x_train, and we can convert this to a numpy array. Remember we want everything but the last column(SalePrice) , so we have to include that in our code.

It should look like below now after it is run.

Now it is time to incorporate our output, which we shall call y_train. This is our last column. You can edit the last code cell again to do this. The code should look like below and you can see what y_train looks like.

We can make a linear regression on the inputs and map to the output.

If you try running a linear regression with the sklearn library, you will get an error.

The error indicates the presence of missing values in our inputs. We can delete these rows, but merely deleting rows with missing values is unsuitable for Kaggle competitions that rely on the Id value; therefore, we can employ ‘Mean Amputation’ to impute the missing values by substituting the mean of the corresponding column for any missing values.

To do this we are going to add a cell before our x_train and y_train cell. Using fillna(), we are going to replace all missing values with the mean of that column.

Now go back up and run the cells from the deepcopy and below all over again. Your previous error should be gone.

Linear regression uses y=b0 +b1x1 +b2x2.

If you type lr.coef_, you should also be able to get the b1 and b2 value.From the array we can see that b1 is about 1011.22 and b2 is about 1.42.

To obtain b0, we use lr.intercept_ . b0 is 95199.76.

x1 is the LotFrontage input and x2 is the LotArea input.

Now it is time to apply this model to our test data frame.

Remember the test data frame will not have the SalePrice column. So your code should look like below:

Since we ran into missing values with the train dataset, we are going to do the same with the test data set.

Time to make the array with the test data.

We can now make predictions using this array. This will be equal to the linear model on x_test.

Time to make a submission! We are going to do what we did earlier and convert to a csv file.

Click Run all at the top to make sure there are no errors. Now you can download your new csv and submit to the competition again.

Your linear prediction is more accurate than your average prediction if everything was done correctly.

Congratulations!! You have just gained a ton of knowledge and submitted your first Machine Learning model. We used a linear model, but of course there are many other various types of models you can use in the future.

Good luck!!

--

--