Approach to Solve Regression Problems

Harsh kothari

Published in

Geek Culture

7 min readFeb 3, 2021

Some crucial things many data scientist forget to consider

Hi Guys! 🌞

I hope, you all will like my approach and my way of looking at the small-small details of the problem. so, let’s get started 🏃

instant: A unique row identifier
dteday: The date on which the data was observed — in this case, the data was collected daily; so there’s one row per date.
season: A numerically encoded value indicating the season (1:spring, 2:summer, 3:fall, 4:winter)
yr: The year of the study in which the observation was made (the study took place over two years — year 0 represents 2011, and year 1 represents 2012)
mnth: The calendar month in which the observation was made (1:January … 12:December)
holiday: A binary value indicating whether or not the observation was made on a public holiday)
weekday: The day of the week on which the observation was made (0:Sunday … 6:Saturday)
workingday: A binary value indicating whether or not the day is a working day (not a weekend or holiday)
weathersit: A categorical value indicating the weather situation (1:clear, 2:mist/cloud, 3:light rain/snow, 4:heavy rain/hail/snow/fog)
temp: The temperature in celsius (normalized)
atemp: The apparent (“feels-like”) temperature in celsius (normalized)
hum: The humidity level (normalized)
windspeed: The windspeed (normalized)
rentals: The number of bicycle rentals recorded.

I have already cleaned and pre-processed the dataset as most of you would know how to deal with missing values, apply normalization, converting categorical features to numeric and other pre-processing techniques.

1. Normalization and Distribution

If you look at the data carefully, I have already normalized features such as temp , atemp , hum , windspeed . Normalization will prevent the model from getting bias and preventing over-dependencies over other features.

Now let’s explore these features by Describe() method.

The statistics reveal some information about the distribution of the data in each of the numeric fields. From this, we can see that the mean number of daily rentals is around 848; but there’s a comparatively large standard deviation, indicating a lot of variance in the number of rentals per day. We might get a clearer idea of the distribution of rentals values by visualizing the data.

The numeric features seem to be more normally distributed, with the mean and median nearer the middle of the range of values, coinciding with where the most commonly occurring values are.

Note: The distributions are not truly normal in the statistical sense, which would result in a smooth, symmetric “bell-curve” histogram with the mean and mode (the most common value) in the center; but they do generally indicate that most of the observations have a value somewhere near the middle.

2. Feature Engineering

For example, let’s add a new column named day to the dataframe by extracting the day component from the existing dteday column. The new column represents the day of the month from 1 to 31.

Though, this feature is of Not much usable — Wanna know why?

The day feature we created for the day of the month shows little variation, indicating that it’s probably not predictive of the number of rentals and also is uniformly distributed.

3. Target Feature

Rental Distribution- let’s look at hist-plot and box plot.

Histogram Distribution | Photo by Author

The plots show that the number of daily rentals ranges from 0 to just over 3,400. However, the mean (and median) number of daily rentals is closer to the low end of that range, with most of the data between 0 and around 2,200 rentals. The few values above this are shown in the box plot as small circles, indicating that they are outliers — in other words, unusually high or low values beyond the typical range of most of the data.

4. Categorical Features

We’ve explored the distribution of the numeric values in the dataset, but what about the categorical features? These aren’t continuous numbers on a scale, so we can’t use histograms, but we can plot a bar chart showing the count of each discrete value for each category.

Many of the categorical features show a more or less uniform distribution (meaning there’s roughly the same number of rows for each category). Exceptions to this include:

holiday: There are many fewer days that are holidays than days that aren’t.
workingday: There are more working days than non-working days.
weathersit: Most days are category 1 (clear), with category 2 (mist and cloud) the next most common. There are comparatively few category 3 (light rain or snow) days, and no category 4 (heavy rain, hail, or fog) days at all.

5. Looking at Relationships

Now that we know something about the distribution of the data in our columns, we can start to look for relationships between the features and the rentals label we want to be able to predict.

For the numeric features, we can create scatter plots that show the intersection of feature and label values. We can also calculate the correlation statistic to quantify the apparent relationship.

The results aren’t conclusive, but if you look closely at the scatter plots for temp and atemp, you can see a vague diagonal trend showing that higher rental counts tend to coincide with higher temperatures; and a correlation value of just over 0.5 for both of these features supports this observation. Conversely, the plots for hum and windspeed show a slightly negative correlation, indicating that there are fewer rentals on days with high humidity or windspeed.

Now let’s compare the categorical features to the label.

There’s a noticeable trend that shows different rental distributions in summer and fall months compared to spring and winter months.

There’s a clear difference in the distribution of rentals on weekends (weekday 0 and 6) and those during the working week (weekday 1 to 5).

Similarly, there is noticeable differences for working day categories.

6. Model Evaluation

Experiment with different regression models and for model evaluation, you can use RMSE(Root mean squared error), R2, etc. Use Hyperparameter tunning for better results.

Note: You can find out more about these and other metrics for evaluating regression models in the Scikit-Learn documentation

For this example, Comparing each prediction with its corresponding “ground truth” actual value isn’t a very efficient way to determine how well the model is predicting. Let’s see if we can get a better indication by visualizing a scatter plot that compares the predictions to the actual labels.

I. Linear Regression

RMSE: 449.4135728595166
R2: 0.6040454736919189

II. Lasso Regression

RMSE: 448.5038527519959
R2: 0.605646863782449

III. Decision Tree

RMSE: 490.7097271948421
R2: 0.5279344839454737

IV. Random Forest

RMSE: 329.46670935564396
R2: 0.7871978460868888

For good measure, let’s also try a boosting ensemble algorithm.

V. Gradient Boosting

RMSE: 322.4734419735391
R2: 0.7961358554502365

7. Pipeline

Now that you know which model to use! you can use the pipeline to encapsulate all of the preprocessing steps as well as the regression algorithm. Sample code for the same is shown below:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
import numpy as np# Define preprocessing for numeric columns (scale them)
numeric_features = [6,7,8,9]
numeric_transformer = Pipeline(steps=[
 (‘scaler’, StandardScaler())])# Define preprocessing for categorical features (encode them)
categorical_features = [0,1,2,3,4,5]
categorical_transformer = Pipeline(steps=[
 (‘onehot’, OneHotEncoder(handle_unknown=’ignore’))])# Combine preprocessing steps
preprocessor = ColumnTransformer(
 transformers=[
 (‘num’, numeric_transformer, numeric_features),
 (‘cat’, categorical_transformer, categorical_features)])# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[(‘preprocessor’, preprocessor),
 (‘regressor’, GradientBoostingRegressor())])# fit the pipeline to train model on the training set
model = pipeline.fit(X_train, (y_train))

👉Thank you for reading this story ❤.