How to ace your first hackathon — Tutorial in Python

Machine Hack — Doctor’s Fee Prediction Challenge

supreet deshpande
7 min readFeb 23, 2019

Hello! Are you a newbie to the world of hackathons? Do you feel overwhelmed and intimidated by Kaggle and wonder how to start solving a data science problem? Does it bother you that with so many impressive and complex models out there in the competition, would your knowledge be ever enough? Do you have the, what I like to call, ‘Kaggle-o-phobia?

Well, fear not. I’ve been there too and as they rightly say — “The best things lie on the other side of fear!”

So, this tutorial on ‘predicting doctor’s consultation fee’, a hackathon hosted on machinehack.com takes you through each and every step in detail and helps you understand why you’re doing what you’re doing. A prerequisite for this tutorial is to be inquisitive! So, let's get started.

Problem Statement

Source: Machine Hack

We have all been in situations where we go to a doctor in an emergency and find that the consultation fees are too high. But, what if we had important details about a doctor? Could we predict the doctor’s consulting fee? So, let's build a Machine Learning model that helps us predict a doctor’s fee.

Datasets

We will be using two datasets — Train data and Test data (link).

Screenshot of the Training data (5961 rows): Training data refers to that portion of data used to fit a model.

Note that the training data is extremely untidy and we will have to perform several data transformations before we use it to build a model.

Screenshot of the Test data (1987 rows): To assess how well the final model might perform on additional data

Note that the test data is similar to the training dataset, minus the ‘Fees’ column (To be predicted using the model).

Python Coding

Step 1: Import the relevant libraries in Python.

Step 2: Import Train and Test datasets and append them

Appending the datasets makes coding simpler and all the data transformations can be applied to both the datasets at once. After data-cleaning and data-transformations, the appended dataset would be split back to test and train datasets.

Step 3: Feature Generation

This is probably the most important step in the entire process as we will be taking raw, unstructured data and defining features (i.e. variables) for potential use in our ML model. Let’s go through each feature in detail and try to understand the rationale behind those transformations

Qualification: In the column ‘Qualification’, we can see that there are multiple qualifications for one doctor. Hence, it makes sense to split these qualifications into different columns. As there would be many blank spaces in the newly created columns, we will replace them with “XXX”. (We will know in a while why we didn’t keep them as blanks)

Qualification split into 3 variables (Qual_1, Qual_2, and Qual_3); blanks replaced with “XXX”
Screenshot of the newly created columns

Experience: In the column ‘Experience’, values are stored as a string. Let us convert them into integers so that we can easily use them as a model feature.

Remove the ‘%’ from the string and convert it to an integer

Rating: The column ‘Rating’ is present as a string. Let’s convert it to percentages (integer). We can treat missing values either with the mean or use 0% as a replacement.

Place: In the column ‘Place’, we can see that there’s a locality followed by a major city. Hence, we’ll split it into two variables and follow similar steps as above.

Screenshot of the newly created columns

Profile: Profile info is already in good shape. So, let's keep it as is.

Miscellaneous_info: This is the most notorious column and we will have to generate as many features as possible. If you observe the entries carefully, you’ll see that generally there’s a rating %, followed by the number of people who rated and then the doctor’s address. So, we can split it into 3 variables.

Split ‘Misc_info’ into Misc_1, ‘Misc_2' and ‘Misc_3'

Voila! 70% of the work is done. Now that we have all the features, let’s create a new data frame with just the relevant variables (features).

Step 4: Feature Selection

After generating features, it is often necessary to test transformations of the original features and select a subset of this pool of potential original and derived features for use in our model. Using too many features can result in multicollinearity among them whereas extracting the minimum number of features might not give us the best results.

So, let’s create a new data frame with the relevant features that would go into the model and split it back as train and test data. (Remember we had appended them to perform data transformations simultaneously on both of them?)

Create a new data frame; split it as train and test; drop ‘Fees’ column from the test data

Finally, after all the data munging, we now have good features that can be used to train our model. In other words, these features will influence the model output.

Step 5: Choosing a Machine Learning algorithm

Well, there are tons of brilliant algorithms out there that could be used to solve a problem, but it's extremely important to have a basic understanding of how these algorithms work and which one fits well with our case.

Some learning material before we jump into modelling

This article explains the basic concepts of machine learning and gives some intuition of using different kinds of machine learning algorithms in different tasks.

Below mentioned is another great article that briefly explains the top machine learning algorithms. As the author, James Le rightly says

Of course, the algorithms you try must be appropriate for your problem, which is where picking the right machine learning task comes in. As an analogy, if you need to clean your house, you might use a vacuum, a broom, or a mop, but you wouldn’t bust out a shovel and start digging.

Now that we are we are familiar with most commonly used machine learning algorithms, let’s go ahead and see which one would we be using.

Extreme Gradient Boosting (XGBoost)

XGBoost, short for “Extreme Gradient Boosting”, was introduced in 2014. Since its inception, XGBoost has become one of the most popularly used machine learning algorithms in all hackathons and competitions.

Before moving ahead, I strongly recommend reading this article which beautifully explains the math behind bagging and boosting.

Step 6: Prepare categorial variables for XGBoost using label encoder

Internally, XGBoost models represent all problems as a regression predictive modelling problem that only takes numerical values as input. If your data is in a different form, it must be prepared into the expected format.

To convert categorical text data into model-understandable numerical data, we use the Label Encoder class. So all we have to do, to label encode a column is import the LabelEncoder class from the sklearn library, fit and transform the column of the data, and then replace the existing text data with the new encoded data.

Label Encoding for categorical variables

Note that we are encoding categories present only in the test data. This is done to avoid all the categories that are present in the train data but are not present in the test data as they would not be relevant to the model.

Please note that label encoding on test data is relevant here because we have the test data. If test data is not provided, we must encode variables in the train data.

Step 7: Pull encoded features from test data to train data

Create unique lists of [variable, variable code] combination
Pull the respective encoded variables in the train data (Using a left join)

Step 8: Create X and y datasets

X — independent variables; y — dependent variable

Machine learning algorithms are described as learning a target function (f) that best maps input variables (X) to an output variable (Y): Y= f(X)

Step 9: Import XGBoost and convert the dataset to DMatrix

DMatrix is the data matrix used in XGBoost. It is an internal data structure that is used by XGBoost which is optimized for both memory efficiency and training speed.

Step 10: Create cross-validation data sets and execute the model

Test and Train data are created for the cross-validation of the results using the train_test_split function from sklearn’s model_selection module with test_size size equal to 30% of the data. Also, random_state is assigned to maintain the reproducibility of results.

The next step is to instantiate an XGBoost regressor object by calling the XGBRegressor() class from the XGBoost library with the hyper-parameters passed as arguments. For classification problems, we would have used the XGBClassifier() class.

Finally, fit the regressor to the training set and make predictions on the test set using the familiar .fit() and .predict() methods.

Step 11: RMSE and Final Prediction

Compute RMSE; Prepare test-data; Use the model created to predict ‘Fees’ for test-data

Final Remarks

Hurray! You just finished your data science project. Go ahead and submit your results! I hope this tutorial was useful in some or the other way. In this tutorial, we saw how to transform data, generate new features, build a machine learning model, and predict results on the test data. However, this is what I would call the baseline model. Our goal must be to beat the baseline model by fine-tuning the features and model parameters. Nevertheless, this is a great start and we shall keep improving.

In my next article, I’ll share some interesting insights about this data (Doctor’s Fee Prediction) and how we can accelerate to the top of the leaderboard with some very simple hacks.

Please let me know your thoughts about this tutorial and do comment if you face any issues.

Cheers!

--

--