Keyboard sale rate prediction by Poisson Regression.

Jung-a Kim
Analytics Vidhya
Published in
7 min readAug 23, 2020
Keyboard image by Jay Zhang on unsplash.com

With the trend of online shopping malls replacing traditional malls, more and more people are getting interested in becoming an online seller.

The purpose of this article is to give some insights to online sellers who may be interested in finding the characteristics of product postings that might increase the sale of their products. The data used for this project is the query results of typing ‘keyboard’ in ebay.com and it was scraped using ‘BeautifulSoup’.

The raw data is messy and there are lots of duplicate product postings as eBay has an option for users to opt for automatically re-listing the item if it doesn’t sell. Also there’s a lot of cleaning to do such as stripping out less meaningful strings, converting data types, removing sparse columns, etc.

With initial datasets cleaned, there were 5,211 observations left which are then split again with 7:3 ratio. There’s still more engineering to do such as imputating missing values, checking multicollinearity, feature-engineering, etc.

Let’s check which variables have missing values.

np.sum(pd.isna(x_train), axis =0)price                  7
rating 3418
num_ratings 0
watcher 0
shipping 2
free_return 0
open_box 2
pre_owned 2
refurbished 2
benefits_charity 0
price_present 0
rating_present 0
shipping_present 0
status_present 0
dtype: int64

There are 3,418 rating (92%) that are missing. Imputating with mean or median would underestimate the variance of ratings, which may not be an ideal solution. Here, we use MICE(Imputation by Multiple imputation by chained equations) which uses regression to predict the missing value with the other features. You can check out “MICE steps” from this link if you want more details: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

Now let’s check the target’s distribution in a training set.

Distribution of target

It is extremely skewed to the right. This may not satisfy the Normal assumption about target in linear regression models. We may need to consider transformation of target or even Poisson regression since Poisson is skewed to the right when the mean is close to zero. Log-transformation can reduce the skewness.

In regression, there are more assumptions to check: linearity between each feature and the (transformed) target, interaction effects, and constant variance of residuals.

Of course, the assumptions are not going to be met perfectly, but they should at least be checked if we want to reduce bias of the estimated coefficients in the model.

Linearizing the relationship between Sale Volume(target) and Price(feature)

The above plot shows that after transforming price variable to 1/sqrt, the relationship with the target is more linearized.

Interaction plot between ‘watcher’(number of views) and ‘free return’(binary variable — whether returning the product for free or not)

Including the interaction plots also relaxes the strict assumption that each feature affects the target in the same way per unit increase. The above plot is one example that interaction term ‘watcher*free_return’ should be included in the model as the number of views(‘watcher’) has less impact on the sale volume(‘sold’) when there is free-return policy.

Was feature-engineering overall helpful? Yes! As one metric, Rsquared increased by 1.5 times from 0.262 to 0.399. Feature-engineering is helpful in fitting the data better especially when you don’t have enough features to fit the model. In this dataset, “reviews” of the buyers are missing which may be one of the most important feature in predicting sale rate.

Before diving into modeling, there’s one more important step: outliers.

Cook’s distance

An observation’s Cook’s distance is a product between its residual and its distance from centroid of the feature space. In a nutshell, it measures how unusual the observation is in terms of X(features) and y(target). Assuming that Cook’s distance has a proxy F-distribution, Cook’s distance of about 0.8(40th percentile of F) means removing this observation pushes the estimated coefficient to 40% confidence region which may seem dramatic change after omitting just one observation. It turns out, this outlier was just a keyboard cover, not an actual keyboard which seems to be legitimate reason to remove from the data. Removing this observation also helps in constant variance assumption.

Distribution of fitted values of linear regression vs. Poisson regression

Linear Regression and Poisson Regression were fit to the data. Linear regression seems to estimate the target distribution better.

MAE(Mean Absolute Error) is 43.8 for Linear Regression and 60.5 for Poisson Regression.

With the hold-out validation set, both models are overfit, but Linear Regression did much better than Poisson Regression in terms of MAE. In fact, Poisson Regression did worse than the sample mean.

Linear regression MAE: 62.9, R2 score: 0.03
Poisson regression MAE: 119.3
MAE with sample mean(training) is 95.4

Regularization seems to be necessary for these models. ‘statsmodels’ package was used for Poisson regression and it has Elastic-Net regularization only. ‘sklearn’ has Lasso and Ridge. Trying these regularization gives different results.

MAE against regularization weight of Linear Regression(left) and Poisson Regression(right)

With linear regression, there’s no change in MAE after regularization. With Poisson regression, there is clear dip in MAE. MAE decreased in half from 119.3 to 62.1. With impressive improvement after regularization, Poisson regression is chosen as the final model.

[('price', 0.7030020418009802),
('rating', 0.0),
('num_ratings', 0.0),
('watcher', 0.0),
('shipping', 0.47952629436260674),
('free_return', 0.5762104970838818),
('open_box', 0.0),
('pre_owned', 0.0),
('refurbished', 0.0),
('benefits_charity', 0.0),
('price_present', 0.0),
('rating_present', 0.30835693807441716),
('shipping_present', 0.0),
('status_present', 0.0),
('watcher * free_return', 0.0),
('watcher * refurbished', 0.0),
('shipping * benefits_charity', 0.0),
('price * shipping', 0.0),
('num_ratings * shipping', 0.0)]

With Elastic-Net regularization, 15 out of 19 features are zeroed out(which is personally a little depressing after all feature engineering and interaction analysis).

Now, with test dataset, the final results are as follows.

NMAE: 0.579
MAE for the model: 42.67
MAE with the sample mean of train+val: 73.7
Distribution of Poisson-model-fitted values

Normalized Mean Absolute Error is 58% which means 42% less error than the sample mean. The fitted values look much more like target distribution after regularization.

Interaction between price and shipping cost

There’s weak but interesting interaction between price and shipping cost. Overall, price has negative linear relationship with sale volume as expected, but when the shipping costs more than 10 dollars, people are more sensitive to the price. This kind of makes sense since the item is a keyboard, generally a cheap product, so people would be reluctant to buy a keyboard when the delivery is too expensive.

The model may be better after all the efforts taken, still the actual sale volume is much more skewed with 70% of the items that are not sold at all. And this extreme skewness doesn’t cope with the usual assumption about Poisson regression where the mean and the variance are the same.

Perhaps, in the future, negative binomial distribution may be something to consider since it has a dispersion parameter k in var(Y)=μ+μ^2/k where one can adjust this parameter k to control the variance for different feature values.

Lastly, let’s interpret this model in a meaningful way.

Sale Volume = C * exp{𝛽1/√(Price+1) + 𝛽2/√(Shipping Cost + 1) + 𝛽3*Free return + 𝛽4*Rating present} + ε where Sale Volume ~ Poisson(λ)

𝐶 ≈ 0.239, 𝛽1 ≈ 9.731, 𝛽2≈ 1.413, 𝛽3 ≈ 1.307, 𝛽4 ≈ 1.167

𝛽1 ≈ 9.731 means, increasing price of your keyboard from $29.00 to $39.00 can reduce sale volume by 21.2% on average.

1-exp{𝛽1(1/√(39+1)-1/√(29+1))}≈0.212.

𝛽2 ≈ 1.413 means, increasing shipping cost from free-of-charge to just by $1.00 can reduce sale volume by 33.9% on average.

𝛽3 ≈ 1.307 means, if you remove the cost of returning, it can increase sale volume by 3.7 times on average. 𝑒^𝛽3 ≈ 3.70.

𝛽4 ≈ 1.167 means, if there is at least one review with the rating, it can increase sale volume by 3.2 times on average. 𝑒^𝛽4 ≈ 3.21.

With these inferences made, I could make a few suggestions to keyboard sellers:

  1. Be cautious about increasing price under $50 since it can reduce sale rate by at least 10% on average.
  2. People are sensitive about shipping cost more than $10. Increasing price will suffer large decrease in sale rate.
  3. Free return is highly encouraged!
  4. Reward writing product reviews!

And also, there are some caveats in this model that should not be ignored.

  1. There is lack of confidence interval of the weights in these suggestions, which means the aforementioned impacts of price, shipping cost, free return, and rating are just an estimated average value and the actual impact may vary a lot.
  2. It is likely that the weights are biased since strong predictors such as content of buyer’s reviews, rating of the sellers are missing.

In the future, one could reduce the variance of weights by dimensionality reduction such as PCA to treat multicollinearity that wasn’t treated in this project.

p.s: If you’d like to see the python code, it’s available at https://github.com/jung-akim/keyboard

LinkedIn of the author

--

--