source wikipedia and eduCBA

Beginners Guide for Working on Kaggle Datasets, creating data pipelines using ColumnTransformers and More!

Chesta Dhingra
Geek Culture
Published in
5 min readApr 23, 2023

--

In the article we’ll be focusing on the criteria how to evaluate the data and what techniques can be applied before creating the model and one spoiler alert!!!! Here we’ll be using the ensemble learning algorithm named Random Forest Regressor and will evaluate its benefits.

Content that is covered on the article is based on the analysis and model development that will help in clearing our theoretical concepts while implementing those concepts side by side. You can find the notebook here.

  1. Introduction of Data
  2. Exploratory Data Analysis :- Univariate, Bivariate and Multivariate Analysis (in multivariate we’ll be creating correlation matrix that helps to know more about how independent variables are related to each other.)
  3. Creating the normalized and one-hot encoding pipelines for doing data preprocessing which is an important step.
  4. Importance of Cross Validation.
  5. Advantages of using ensembled learning algorithm.

So, let’s begin.

Dataset is from one of Kaggle Competition name Regression with a tabular media campaign. In this we are predicting the Cost, or the amount spend on the media campaigns based on various parameters that includes type customers often visit in the stores, amenities available in the stores, how much units they can sell, what’s store sales etc.

Introduction

Beginning with the data, two excel files are available train and test. Data does not contain any null or duplicated values. It does contain categorical as well as numerical data for example florist, salad_bar, coffee_bar, low_fat etc. are categorical whereas sales in millions, store_sqft, gross_weight etc.

Exploratory Data Analysis

1 Boxplots for ordinal data in respect to our target variable has not shown wide variance among the values. Besides that while calculating the averages via grouping the categorical or ordinal data in respect to our Y variable not much fluctuation among the mean has been observed.

2 Boxplots for binary variables where we observe that variance are less in regards two both the categories of binary variables. Also have plotted the graph for binary variables that helps to show the distribution of them in respect to the target variables.

3 Density plot for the numeric variables that helps the provide the information distribution of numeric values and able to observe that numeric variables does have wide distribution and are not following a normal distribution shape or the bell curve.

4 Lastly have create a Correlation matrix that provides an information that how independent variables are been related to one another with the information we can conclude that when two independent variables are highly correlated to one other, they are providing redundant information to the model and one of the variables can be dropped from the data itself. In this particular case prepared_food and salad_bar are exactly correlated by having value of 1, which leads to multicollinearity in the data.

Creating Piplines

1 An essential part before creating the model is to do data preprocessing, in which we are normalization of numerical variables. The Scaling technique helps to compare different features having different units or ranges.

2 OneHotEncoding helps to effectively deal with the variables which belongs to categorical as well as binary classes.

3 In this, while creating the instances for both MinMaxScaler as well as OneHotEncoding we will use the ColumnTransformers from scikit learn that effectively helps in creating the pipelines further that makes code much more efficient specifically for production ready purposes.

Model Building

1 Cross Validation :- What do we understand by Cross Validation? This methodology helps in building a more stable model, less randomness in terms of noise will be observed in regards to our model quality. As we will be running the modeling process on different subsets of data. In this we’ll be using KFold methodology for doing cross validation. In this KFold first we’ll be dividing raw training data into 5 different datasets and on each of these five different datasets we are creating 5 folds and one fold would be held for evaluation purpose. This would help in making our model more generalized and helps in providing more consistent results.

2 RandomForest Regressor :- As we are already know for making prediction we’ll be using RandomForest Regression which is an advanced ensemble learning technique based on bagging (bootstrap aggregation). This ensemble technique again helps in reducing the over-fitting on the training dataset that leads to high variance and low biasness. This is quite prevalent in decision trees, thus random forest makes the model more flexible. As we’ll be creating different subsets from the data in which possibility of having one observation more than once in the same subset data. Subset Data would have less observations in comparison to our original training data. Besides this some observations might not occur in one subset data and on these several subset data we’ll be creating the Decision tree models, which means each subset data would have its own DT model. And every model will provides its own results. These models are also known as weak learners and in the second step we’ll be combining the results from each model and in terms of prediction we’ll be coming to our final conclusion via aggregating the results from each model. Lastly with the ensemble learning algorithm we can find the features which are more important while predicting the results and can improve the model via eliminating those variables which are not much contributing in making the predictions for our target variable.

These are certain steps that can be followed but other than that in upcoming article we’ll see some more techniques while working on the datasets and how we can improve model accuracy via feature engineering and feature selection techniques.

Hope you enjoyed reading the article, if you did follow me on medium for more such articles where I am passionately writing about data science. And do provide your feedback and if have any question we can connect on Linkedin.

--

--