Elo Merchant Category Recommendation(Case Study): A Kaggle competition

Abhishek Malik
TheCyPhy
Published in
21 min readOct 22, 2020

Author: Abhishek Malik[linkedin]

Objective: Customer Loyalty Score Predictions.

Keywords: Predictions, Loyalty score, Regression, Machine Learning, LightGBM.

Table Of Content:

  1. Business Problem/Real-world problem.
  2. Source of Data.
  3. Mapping the real-world problem to an ML problem
  4. Exploratory Data analysis and observations.
  5. Existing approach of the problem.
  6. My first cut approach to solving the problem.
  7. Feature Engineering
  8. Comparison of the models in tabular format.
  9. Kaggle submission.
  10. The final pipeline of the problem.
  11. Challenges and limitations faced in solving the machine learning problem.
  12. Future Work.
  13. References.

1. Business problem/Real-world Problem

1.1 What is ELO?

It is one of the biggest and most reliable payment brands in Brazil. It planned a reward program to attract customers. So, the frequency of using their payment brand has increased.

1.2 What is a loyalty Score?

Loyalty is a numerical score calculated 2 months after the historical and evaluation period. It acts as a target feature in our training data.

1.3 Problem Statement

Elo merchant category recommendation problem talks about the customer loyalty of credit cards for their users in Elo. In Brazil, it is one of the biggest and most reliable payment brands. This reward program is planned by the owners of a company to attract customers. So, the frequency of using their payment brand has increased. Basically, these programs make the customer’s choice more strongly towards the usage of Elo. It is also necessary that policies made by the companies are known to his customers. Here we have the loyalty score which is a numerical score calculated 2 months after the historical and evaluation period.

1.4 Real-world/Business objectives and constraints

  • Here we predict loyalty scores to help the customers and also it helps reduce the unwanted campaign for Elo.
  • Here we use RMSE(Root-mean-square-error) for reducing the difference between predicted and actual rating(Regression problem).

2. Source of Data

Right now, Elo, one of the largest payment brands in Brazil, has built partnerships with merchants to offer promotions or discounts to cardholders. But do these promotions work for either the consumer or the merchant? Do customers enjoy their experience? Do merchants see repeat business? Personalization is key. Elo has built machine learning models to understand the most important aspects and preferences in their customers’ lifecycle, from food to shopping. But so far none of them is specifically tailored for an individual or profile. This is where you come in.

Link: ELO-merchant-category-recommendation

2.1 Data Overview:

We have 6 dataset files for this problem. All the files are in CSV format.

  • Historical_transactions: Contains up to 3 months of transactions for every card at any of the provided merchant_id’s.
  • Merchant: contains the aggregate information for each merchant_id represented in the dataset.
  • New_merchant_transactions: contains the transactions at new merchants(merchant_ids that this particular card_id has not yet visited) over a period of two months.
  • Train: Contains 6 features, which is first_active_month, card_id, feature_1, feature_2, feature_3 and target.
  • Test: Contains the same feature as present in train data but the target feature is not present in this dataset.
  • Sample_submission: This file contains all the card_ids for which we have to make the predictions by our machine learning model.

I observed that in all these data we have only categorical and numerical features, not text data is present in it.

3. Mapping the real-world problem to an ML problem

There are certain loyalty scores for each of the card_id representing in Train data. All of the loyalty scores are real-numbers which declared it as a regression problem.

  • This transaction data can be re-framed as a supervised learning dataset by using Feature Engineering techniques and then we apply machine learning algorithms for having loyalty scores for our test data.
  • Our loyalty scores are real-numbers. It directly gives us the idea that we have to go for a machine learning regression model to solve this problem. Where features are been as our input in train data and output is our real-number value which is our predicted loyalty score.
  • Metric Function: We use Root mean square error to evaluate our predictions with actual predictions. As our predicted loyalty score is close to the actual prediction. We have the less Root mean squared error score. This gives us the knowledge that on the basis of transactions how close our model makes the predictions as compare to actual predictions.

4. Exploratory Data analysis and observations.

For solving a machine learning problem. It is very necessary to first have an understanding of the data. Which is done only by the Exploratory data analysis. It does not only gives the understanding of data but also we able to do the featurization for the existing data.

Basically, all the five Datasets file which is fictional:

  • Train data have 201917 rows and 6 columns
  • Test data have 123623 rows and 5 columns.
  • Historical_transactions have 29112361 rows and 14 columns
  • New_merchant_transactions have 1963031 rows and 14 columns
  • Merchant data have 334696 rows and 22 columns.

4.1 Explore Train and Test data.

Train and Test data column description

  1. Card_id: Unique card identifier
  2. First_active_month: ‘YYYY-MM’, the month of first purchase
  3. Feature_1: Anonymized card categorical feature
  4. Feature_2: Anonymized card categorical feature
  5. Feature_3: Anonymized card categorical feature
  6. Target: Loyalty numerical score calculated 2 months after the historical and evaluation period.
  • Here one thing I want to highlight a basic difference between train and test data is we do not have the target column in test data because we have to make the prediction for that data.
  • Firstly we try to explore our feature distribution and comparison of features with each other in train and test data.

Observation:

  • In feature_1, feature_2, and feature_3 train and test data are equally distributed.
  • But there is an anomaly in target features, there are some values that are far from the distribution which is less than -30. It means these are the outlier values.

Now we try to found out what is the count of these outlier values and how many values are non-outliers in this target feature.

outliers_in_target= train_data.loc[train_data['target']< -30]print(' The number of outliers in the data is:',outliers_in_target.shape[0])non_outliers_in_target= train_data.loc[train_data['target'] >=-30]print(' The number of non-outliers in the data is:',non_outliers_in_target.shape[0])
  • The number of outliers in the data is: 2207
  • The number of non-outliers in the data is: 199710

Separate the outliers

  • After observation of the target feature, we have a clear picture of outlier values and non-outlier values. Let us have a comparison of an outlier and non-outlier values comparison in all the three features of train data.
credits:https://www.kaggle.com/batalov/making-sense-of-elo-data-edaplt.figure(figsize=[12,6])
for i,j in enumerate(['feature_1','feature_2','feature_3','target']):
if j is not 'target':
plt.subplot(2,3,i+1)
non_outliers=non_outliers_in_target[j].value_counts() /non_outliers_in_target.shape[0]
plt.bar(non_outliers.index, non_outliers,label=('non_outliers_in_target'),align='edge',width=-0.3,edgecolor=[0.2]*3,color=['yellow'])outliers=outliers_in_target[j].value_counts() /outliers_in_target.shape[0]plt.bar(outliers.index, outliers,label=('outliers_in_target'),align='edge',width=0.3,edgecolor=[0.2]*3,color=['blue'])
plt.title(j)
plt.legend()
plt.tight_layout()
plt.suptitle(' The feature distribution in outlier and non-outlier')
plt.show()
The Feature Distribution in outlier and non-outlier.

Observation:

  • In feature_1, feature_2 and feature_3 difference between the outliers and non-outliers are very small. So, It means if we include these values it will definitely be a problem for our training. So, It is better to remove these values from our dataset.

Correlation analysis of train and test data features.

Correlation analysis of train data

Observation:

  • In this correlation of features of train data feature_1,feature_3 are correlated. Whereas, feature_1 and feature_2 are also highly correlated with a score of 0.85.

Observation:

  • Most of the features are pretty well correlated. but feature_1 and feature_2 are the most correlated feature.
  • Here feature_2 and feature_3 is the second most correlated feature.
  • In comparison to the train data, here some features are more correlated with each other.

violin plot in train data for loyalty score to compare all the feature with loyalty score

Violin plots comparison with loyalty score.

Observation:

  • With the help of these boxen plots, we don’t have a clear idea because for loyalty scores all the distribution of features seems the same. I think the model is not able to find something here.

Here, In the train data, we have very few features that are not enough for featurization. Let’s look at the other CSV files.

4.2 Explore the Historical transactions data

Observation:

  • Here every feature is different. let us see what type of feature is this.
  • Their ID type of features is six: card_id, merchant_id, city_id, state_id.
  • Two integer/counter type features data: month_lag, installments.
  • numerical type feature is one: purchase_amount.
  • categorical features are four in the historical_transaction data: authorized_flag, category_3, category_1, category_2.

Comparison of historical transaction features with Loyalty score.

  • Here we observed that -33.21928 is the target value which has the maximum count. It means it is the most occurred target value in comparison to historical_transaction card_id.
  • It means -33.21928 is that target value which is mostly used in the historical transaction card_id.

Binned distribution of historical transaction features with respect to loyalty score.

Boxen plot of historical transactions.

Observation:

  • Number_of_historical_transactions_in_bins : As we have seen that we increase the size of the bins and the range of target values is decreasing in historical_transactions.
  • Sum_historical_transactions_in_bins : Observations tell us that the loyalty score seems to increase with the ‘ sum_of_historical_value ’. well! this is on the expectation. Let us try the same plot on the ‘Mean value of historical transaction’.
  • Mean_historical_transactions_in_bins : Here we understand that loyalty score first decreases and then increases in mean binned historical transactions.

Histogram of all the features in historical_transactions

Observation:

  • In historical_transactions we have an authorized flag feature that tells us about that which transactions are authorized or not.
  • Installment columns have some different values which tell us either this column is normalized but some values are not normal.
  • The purchase amount feature seems to be sorted because the bar graph seems to be decreasing.

4.3 Explore the New_Merchant_transactions.

New_merchant_transaction has the same features in the historical transactions but the main difference is the time of transaction. Historical transactions are the past transactions and new_merchant_transactions are the recent transactions.

Description of New_merchants_transaction features:

  • card_id: Card identifier
  • month_lag: month lag to reference date
  • purchase_date: purchase date
  • authorized_flag: ‘Y’ if approved, ’N’ if denied.
  • category_3: anonymized category
  • installments: number of installments of purchase.
  • category_1: anonymized category
  • merchant_category_id: merchant category identifier(anonymized)
  • subsector_id: Merchant category group identifier(anonymized)
  • merchant_id: Merchant identifier(anonymized)
  • purchase_amount: Normalized purchase amount
  • city_id: City identifier(anonymized)
  • state_id: State identifier(anonymized)
  • category_2: anonymized category

Binned distribution of new_merchant transaction features with respect to loyalty score.

Observation:

  • Number_of_historical_transactions_in_bins :Here we can observe that as the number of new merchants transactions increases loyalty scores to decrease except for the last bin.
  • Sum_historical_transactions_in_bins: Here we observe from the violin plots that as the sum of new merchant transaction increases the loyalty score also been increasing except for the last bin.
  • This Last bin behaves differently because of some values of loyalty score which act as outliers that are less than -30.
  • Mean_historical_transactions_in_bins: Same here in mean the loyalty score is decreasing as the transactions are increasing except for the last bin.

Histogram of all the features in new_merchant_transaction

Observation:

  • The green color plots having categorical values whereas the orange one which has numerical values.
  • Here we found out that all the transactions are approved so there is no point in having that feature because it has only one type of value.
  • As we have seen in the purchase date feature plot, the reference date is different for card_ids, but most of the transactions have happened in February-march 2018 after the reference date.

4.4 Explore the Merchant dataset

Description of merchant dataset.

  • merchant_id: Unique merchant identifier
  • merchant_group_id: Merchant group(anonymized)
  • merchant_category_id: Unique identifier for merchant category (anonymized)
  • subsector_id: Merchant category group(anonymized)
  • numerical_1: anonymized measure
  • numerical_2: anonymized measure
  • category_1: anonymized category
  • most_recent_sales_range: Range of revenue(monetary units)in last active month->A->B->C->D->E
  • most_recent_purchases_range: Range of quality of transactions in last active month->A->B->C->D->E
  • avg_sales_lag3: Monthly average of revenue in last 3 months divided by revenue in the last active month.
  • avg_purchases_lag3: Monthly average of transactions in the last 3 months divided by transactions in the last active month.
  • active_months_lag3: Quantity of active months within the last 3 months.
  • avg_sales_lag6: Monthly average of revenue in the last 6 months divided by revenue in the last active month.
  • avg_purchases_lag6: Monthly average of transactions in the last 6 months divided by transactions in the last active month.
  • active_months_lag6: Quantity of active months within the last 6 months.
  • avg_sales_lag12: Monthly average of revenue in the last 12 months divided by revenue in the last active month
  • avg_purchases_lag12: Monthly average of transactions in the last 12 months divided by transactions in the last active month.
  • active_months_lag12: Quantity of active month within the last 12 months.
  • category_4: anonymized category
  • city_id: city identifier(anonymized)
  • state_id: state identifier(anonymized)
  • category_2: anonymized category

Histogram of all the features in merchant data

Observation:

  • Here we found out that merchant_group_id, numerical_1, and numerical_2 are sorted in decreasing order.
  • In the histogram of purchase range and sales range are sorted in ascending order.
  • numerical_1 and numerical_2 are seemed to have discrete sets of values.

Scatter Plot of merchant data

Observation:

  • Here one thing seems straightforward that average_sales and purchases within the last 3, 6, and 12 months. It seems to increase after months passes.
  • Let explore this recent month's transaction by the histogram.

Histogram plots of sales and purchases to have a sense of transaction productivity.

k = np.array([12, 6, 3]).astype(str)rates_of_sales = merchants_clean[['avg_sales_lag3', 'avg_sales_lag6', 'avg_sales_lag12']].mean().valuesrates_of_purchases = merchants_clean[['avg_purchases_lag3', 'avg_purchases_lag6', 'avg_purchases_lag12']].mean().valuesplt.bar(k, rates_of_sales , width=0.3, align='edge', label='average sales', edgecolor=[0.2]*3)plt.bar(k, rates_of_purchases, width=-0.3, align='edge', label='average purchases', edgecolor=[0.2]*3)plt.legend()plt.title('Avergage sales and number of purchases\n over the last 12, 6, and 3 months', fontsize=17)plt.show()

Observation:

  • It clearly shows us that in the last three months we have growth in business which seems profitable.

Correlation analysis of historical,new_merchant_transactions, and merchant data.

credits: https://stackoverflow.com/questions/48035381/correlation-among-multiple-categorical-variables-pandasdef correlation_categorical(data_frame):data_frame=data_frame.apply(lambda k : pd.factorize(k)[0]+1)# here we get the distinct values in each categorical feature.correlation=pd.DataFrame([chisquare(data_frame[k].values,f_exp=data_frame.values.T,axis=1)[0]for k in data_frame],columns=data_frame.columns, index=data_frame.columns)# Here we get the correlation matrix.norm=np.round((correlation-correlation.min())/(correlation.max()-correlation.min()),3)# here we normalize the correlation matrix.plt.subplots(figsize=(10,8))return sns.heatmap(norm,annot=True)# Here we plot the heatmap of seaborn.
correlation analysis of historical_transactions, merchant data, and new_merchant_transactions

Observation:

  • Historical_transactions: In historical_transactions city_id is the feature that is highly correlated with all the other features.
  • Second, merchant_category_id is the other feature which is highly correlated with other feature.
  • Merchant_data: City-id and merchant_category-id feature values are most correlated in comparison to other features.
  • Numerical_1 and Numerical_2 features are second-most correlated features with other features.
  • New_merchant_transactions:City_id and merchant_category_id are highly correlated features.
  • City_id is the highly correlated value with all the features.
  • merchant_category_id and state_id is the highly correlated value at one point has a score of 0.55.

5. Existing approach of the problem.

First Solution:

LINK: https://www.kaggle.com/roydatascience/elo-stack-with-goss-boosting

An interesting technique of reducing memory usage is introduced in this notebook. This memory reduction technique is made with the inspiration of using only five types of values in which two are int values and the other three are float values. After this outliers were omitted from the data and also extracted more features from the transaction data. Then imputation has been done to handle the missing values so that we can have more data and it will help us to improve our accuracy. After all this preprocessing technique we go for the featurization so that our model will have its learning in a suitable manner. Most of the featurization part inspiration is taken from other kernels in this notebook. In the training part, we use the lightgbm modeling technique with stratified k folds enumerated on training sets, and outliers are also included. Then it similarly includes the lightgbm but slight change is instead of outliers it has a target column this time and also uses repeated k fold cross-validation. Then it applies the stacking in this model. Stacking means training the data on various models and after training on this model we have the meta classifier which finally gives the predictions of data. It is very common in Kaggle competitions to use stacking for having the best results. This kernel only talks about how we can reduce the memory usage for data and the feature is inspired by various kernels and the most important part is stacking on two models to have the best accuracy. So we can be inspired by this kernel that stacking is a useful ensemble technique by which we can improve the predictions.

Second Solution:

LINK: https://www.kaggle.com/mfjwr1/simple-lightgbm-without-blending

Firstly, it tries to make some features on the basis of train and test data. One hot encoding is applied to the train and test data. This point is useful for getting features in integer and float values. This kernel uses the historical transactions to the aggregate max, min, var, and then aggregate this all historical transactions to train and test data. The same aggregation will be done on the new merchant transactions and then fit on the train and test data. It also imputes the missing values in the category_2, category_3, merchant_id, installments. Important features have been added like Christmas day, Children day, Black Friday, Mothers Day. Lastly, important additional features been introduced in this kernel which is really useful for better training. StratifiedkFold is used for cross-validation and light-gbm is used for training. This kernel has a better pipeline process which is very integrated and structured. The introduction of additional features is a beneficial step.

Improvements:

  • First, both the kernels are not used the merchant data for featurization. It creates a lack of information which can be the reason for not getting less RMSE Score. So, we include the merchant data for featurization in our solution.
  • Imputation of missing value is done on only historical transactions and new_merchants _transactions and we also perform the reduced memory usage on all data and replace the -inf,inf values to nan.

6. My first cut approach to solving the problem.

We divide the first cut approach into various parts for better understanding.

First Step: Preprocessing of data

In the preprocessing technique, we have to look out for various aspects of data.

  • In train.csv we have mostly numerical values. Only the card_id column has some text. So, we have to do some preprocessing of that column. But normalization may have to be done for all the numerical features.
  • Target column has real values that have some values in positive and others are negative. So, we can make it a classification problem on the basis of positive and negative values by assigning positive values as 1 and negative values 0.
  • With the help of EDA, if any outliers will found we have to handle.
  • Missing values have been filled by the techniques of mean_imputation, median_imputation, and categorical imputation(in the categorical value we put the most repeated value)
  • As we do not have any text data so no need for BOW, TF-IDF, and WORD2VEC here.

Second Step: Featurization on the basis of data.

  • Only Three features are given in this data which seems to be not that much sufficient to make good predictions. The dimension in this data is very less. It means the information is less for the training.
  • More features must be added to this with the help of domain knowledge or business problem. Which added more information to our data and better training of models will be done.

Third step: Hyperparameter tuning with the help of Gridsearch or Randomsearch.

  • There are so many hyperparameters in models so we have to tune for better predictions.
  • Two ways to get the best hyperparameters are Gridsearch() and RandomSearch(). In Gridsearch, we have the set of values of parameters and in RandomSearch() we have the parameters by randomization.

Fourth Step: Training of models on train and test data with best hyperparameters.

  • Do train the data with the help of base learners having the best hyperparameter then make the predictions on the basis of training on test data.
  • I will try various models that enhance the performance and make the correct prediction after training.

Fifth step: Compare the predictions with the help of a comparison metric(RMSE).

  • As we know, this business problem has RMSE as his performance metrics. So we use this metric to compare the training and testing of our base learners.

Sixth Step: Improve the predictions on test data with the help Ensembling technique.

  • Many times we do not get the desired results after all the efforts. Because we want to get the best performance of our results. So, We try Ensemble techniques. They are of four types: Bagging, Boosting, Stacking, and Cascading. So we try to use one of these techniques to improve our predictions. In Ensembling we use many base learners and use the predictions of these base learners to obtain the best predictions by the use of aggregation, meta-classifier.

7. Feature Engineering

  • First, we make some features on the basis of train and test data which is represented in the notebook. There is feature aggregation in every CSV file.
  • These features are made with the help of train and test data.
'first_active_month', 'card_id', 'target', 'outliers', 'feature_1_mean',        'feature_2_mean', 'feature_3_mean', 'days', 'quarter', 'days_feature_1',        'days_feature_1_ratio', 'days_feature_2', 'days_feature_2_ratio',        'days_feature_3', 'days_feature_3_ratio'
  • The First_active_month, card_id, feature_1, feature_2, feature_3 is found in the train and test data file. So, with the help of these basic features, we make these mean and ratio features. The quarter feature is included with the help of the scikit-learn module named DateTime.
  • feature_1_mean, feature_2_mean, and feature_3_mean are included in the train data when we group by the feature_1, feature_2, and feature_3 outlier data and take the mean of that data. The code block for performing these features is as follows:
for k in ['feature_1','feature_2','feature_3']:
label = data_frame_train.groupby([k])['outliers'].mean()
data_frame_train[k+"_mean"] = data_frame_train[k].map(label)
  • days_feature_1, days_feature_2 and days_feature_3 and days_feature_ratios are inter-related. Because first, we make these features with the multiplication of train data feature_1 with train days data. After this, we divide the train days data with the actual train data feature. I know it seems confusing to you. Let’s look at the code block.
data_frame_train['days'] = (date(2018, 2,1) - sr.dt.date).dt.days
k_cols = ['feature_1', 'feature_2', 'feature_3']
for i in k_cols:
data_frame_train['days_' + i] = data_frame_train['days'] * data_frame_train[i]
data_frame_train['days_' + i + '_ratio'] = data_frame_train[i] / data_frame_train['days']
  • These features are made with the help of historical transaction data.
'hist_transactions_count',        'hist_purchase_amount_sum', 'hist_purchase_amount_max',        'hist_purchase_amount_min', 'hist_purchase_amount_mean',        'hist_purchase_amount_var', 'hist_purchase_amount_skew',        'hist_installments_sum', 'hist_installments_max',        'hist_installments_mean', 'hist_installments_var',        'hist_installments_skew', 'hist_purchase_date_max',        'hist_purchase_date_min', 'hist_month_lag_max', 'hist_month_lag_min',        'hist_month_lag_mean', 'hist_month_lag_var', 'hist_month_lag_skew',        'hist_month_diff_max', 'hist_month_diff_min', 'hist_month_diff_mean',        'hist_month_diff_var', 'hist_month_diff_skew', 'hist_weekend_sum',        'hist_weekend_mean', 'hist_weekday_sum', 'hist_weekday_mean',        'hist_authorized_flag_sum', 'hist_authorized_flag_mean',        'hist_category_1_sum', 'hist_category_1_mean', 'hist_category_1_max',        'hist_category_1_min', 'hist_card_id_size', 'hist_card_id_count',        'hist_month_nunique', 'hist_month_mean', 'hist_month_min',        'hist_month_max', 'hist_hour_nunique', 'hist_hour_mean',        'hist_hour_min', 'hist_hour_max', 'hist_weekofyear_nunique',        'hist_weekofyear_mean', 'hist_weekofyear_min', 'hist_weekofyear_max',        'hist_day_nunique', 'hist_day_mean', 'hist_day_min', 'hist_day_max',        'hist_subsector_id_nunique', 'hist_merchant_id_nunique',        'hist_merchant_category_id_nunique', 'hist_price_sum',        'hist_price_mean', 'hist_price_max', 'hist_price_min', 'hist_price_var',        'hist_duration_mean', 'hist_duration_min', 'hist_duration_max',        'hist_duration_var', 'hist_duration_skew',        'hist_amount_month_ratio_mean', 'hist_amount_month_ratio_min',        'hist_amount_month_ratio_max', 'hist_amount_month_ratio_var',        'hist_amount_month_ratio_skew', 'hist_purchase_date_diff',        'hist_purchase_date_average', 'hist_purchase_date_uptonow',        'hist_purchase_date_uptomin', 'hist_first_buy', 'hist_last_buy'
  • All the historical transaction features and new merchants transaction features are been made with the taking of max, min, mean, var, skew of each feature, and then we made multiple features with the help of these features. For example, hist_purchase_amount_max is the purchase amount feature max value.
  • Similarly, we made multiple max, min, var, skew features for every single feature in historical transactions, and new merchants transaction features.
  • But some features in the code blocks are made with the subtractions, multiplication, and division of max and min value of a feature like as hist_purchase_date_diff is made with the subtraction of hist_purchase_date_max and hist_purchase_date_min.
  • The features made with the help of new_merchant_transaction data.
'new_transactions_count', 'new_purchase_amount_sum',        'new_purchase_amount_max', 'new_purchase_amount_min',        'new_purchase_amount_mean', 'new_purchase_amount_var',        'new_purchase_amount_skew', 'new_installments_sum',        'new_installments_max','new_installments_mean', 'new_installments_var',        'new_installments_skew', 'new_purchase_date_max',        'new_purchase_date_min', 'new_month_lag_max', 'new_month_lag_min',        'new_month_lag_mean', 'new_month_lag_var', 'new_month_lag_skew',        'new_month_diff_max', 'new_month_diff_min', 'new_month_diff_mean',        'new_month_diff_var', 'new_month_diff_skew', 'new_weekend_sum',        'new_weekend_mean', 'new_weekday_sum', 'new_weekday_mean',        'new_authorized_flag_sum', 'new_authorized_flag_mean',        'new_category_1_sum', 'new_category_1_mean', 'new_category_1_max',        'new_category_1_min', 'new_card_id_size', 'new_card_id_count',        'new_month_nunique', 'new_month_mean', 'new_month_min', 'new_month_max',        'new_hour_nunique', 'new_hour_mean', 'new_hour_min', 'new_hour_max',        'new_weekofyear_nunique', 'new_weekofyear_mean', 'new_weekofyear_min',        'new_weekofyear_max', 'new_day_nunique', 'new_day_mean', 'new_day_min',        'new_day_max', 'new_subsector_id_nunique',        'new_merchant_category_id_nunique', 'new_price_sum', 'new_price_mean',        'new_price_max', 'new_price_min', 'new_price_var', 'new_duration_mean',        'new_duration_min', 'new_duration_max', 'new_duration_var',        'new_duration_skew', 'new_amount_month_ratio_mean',        'new_amount_month_ratio_min', 'new_amount_month_ratio_max',        'new_amount_month_ratio_var', 'new_amount_month_ratio_skew',        'new_purchase_date_diff', 'new_purchase_date_average',        'new_purchase_date_uptonow', 'new_purchase_date_uptomin',        'new_first_buy', 'new_last_buy',
  • Previously I noted many times that historical transactions and new merchants transactions have the same-named features but the time of transactions is different. So, as the features are made in historical transactions similarly all the features are made in new merchants transactions.
  • Additional features which are made with the help of historical and new_merchant_transactions data.
credits:https://www.kaggle.com/mfjwr1/simple-lightgbm-without-blending
'card_id_total', 'card_id_cnt_total', 'card_id_cnt_ratio', 'purchase_amount_total', 'purchase_amount_mean', 'purchase_amount_max', 'purchase_amount_min', 'purchase_amount_ratio', 'month_diff_mean', 'month_diff_ratio', 'month_lag_mean', 'month_lag_max', 'month_lag_min', 'category_1_mean', 'installments_total', 'installments_mean', 'installments_max', 'installments_ratio', 'price_total', 'price_mean', 'price_max', 'duration_mean', 'duration_min', 'duration_max', 'amount_month_ratio_mean', 'amount_month_ratio_min', 'amount_month_ratio_max', 'new_CLV', 'hist_CLV', 'CLV_ratio', 'authorized_flag_new_mean_encoded', 'category_1_new_mean_encoded', 'month_lag_new_mean_encoded', 'installments_new_mean_encoded
  • These additional features have been made with the help of historical transactions and new merchant transaction features. For example, Purchase_amount_max is made with the addition of new_purchase_amount_max and hist_purchase_amount_max.Similarly, other features are also made by the simple arithmetic operations on historical and new merchant transactions.

Now we use these 200 features to train our model which is only possible with the help of the aggregation of data of these five files.

8. Comparison of the models in tabular format.

After training of all models on train and test data. I found out that the model of Single LGBM with repeated K-Folds performs very fast and accurate.

Comparison of all models as follows:

If we observe the comparison of results of the above regression models we have Single LGBM with repeated k-Fold as the best model.

The best model which is used for Kaggle submission code is as follows:

Repeated k-fold for cross-validation.

from sklearn.linear_model import BayesianRidge
number_of_folds = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4950)
predictions_target = np.zeros(len(data_frame_train))
predictions_target_test = np.zeros(len(data_frame_test))
data_frame_feature_importance = pd.DataFrame()
dimensions = data_frame_columns

Single LGBM model with repeated K-Folds

for fold_, (trn_idx, val_idx) in enumerate(number_of_folds.split(data_frame_train[data_frame_columns].values, target.values)):print("fold {}".format(fold_))data_train = lgb.Dataset(data_frame_train.iloc[trn_idx][data_frame_columns], label=target.iloc[trn_idx])data_validation = lgb.Dataset(data_frame_train.iloc[val_idx][data_frame_columns], label=target.iloc[val_idx])param ={
'task': 'train',
'boosting': 'goss',
'objective': 'regression',
'metric': 'rmse',
'learning_rate': 0.01,
'subsample': 0.9855232997390695,
'max_depth': 6,
'top_rate': 0.9064148448434349,
'num_leaves': 61,
'min_child_weight': 41.9612869171337,
'other_rate': 0.0721768246018207,
'reg_alpha': 9.677537745007898,
'colsample_bytree': 0.5665320670155495,
'min_split_gain': 9.820197773625843,
'reg_lambda': 8.2532317400459,
'min_data_in_leaf': 21,
'verbose': -1,
'seed':int(2**fold_),
'bagging_seed':int(2**fold_),
'drop_seed':int(2**fold_)
}
number_of_round = 10000
clf_r = lgb.train(param, data_train, number_of_round, valid_sets = [data_train, data_validation], verbose_eval=-1, early_stopping_rounds = 200)
predictions_target[val_idx] = clf_r.predict(data_train.iloc[val_idx][data_frame_columns], num_iteration=clf_r.best_iteration)fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = data_frame_columns
fold_importance_df["importance"] = clf_r.feature_importance()
fold_importance_df["fold"] = fold_ + 1
data_frame_feature_importance = pd.concat([data_frame_feature_importance, fold_importance_df], axis=0)predictions_target_test += clf_r.predict(data_frame_test[data_frame_columns], num_iteration=clf_r.best_iteration) / (5 * 2)print("CV score: {:<8.5f}".format(mean_squared_error(predictions_target_test, target)**0.5))
Screenshot of the result of Single LGBM model with Repeated K-fold

9. Kaggle Submission

Kaggle public and private score is as follows:

Kaggle Submission Screenshot

The Score rank is below 10 percent whereas the rank is 351 out of 4127.

10. Final pipeline of the problem

  1. First, we try to make the features on the basis of train and test data. Then we have to impute the missing values in historical_transactions and new_merchant_transactions data. Then we have to map the values of the features which are in ‘Y’ and ’N’ format to 1 and 0 values and ‘A’, ‘B’, ‘C’, ‘D’ to 1,2,3,4.
  2. After this, we have to aggregate the feature values on the basis of historical_transactions and new_merchant_transactions. and then we have to merge this all historical and new_merchant_transactions aggregation data to train and test data.
  3. we also aggregate the merchant data on the basis of the most common values in the features of merchant data. Then add these features to the train and test data.
  4. We use the single LGBM model with repeated k-fold for the training of the model and used root mean square error to evaluate the actual prediction and predicted predictions.

11. Challenges and limitations faced in solving the machine learning problem.

  • Firstly, I observe that train and test data have only three features. which is insufficient to train the data for best predictions. So, that we have the less RMSE. As I explored that train only these features on the model have more RMSE.
  • Second, to have a good rank score in Kaggle these features are not sufficient. So, we explore other files which are the transaction data files of the Elo case study. One is historical transactions and another one is new merchant transactions. It is clearly understood that historical transactions are past transactions and new merchant transactions are recent transaction data. So, I decided to do my feature engineering on these two files. In last all my features are base on these two particular files by taking mean, max, min, var, skew of features which is present in these transaction files.
  • Imputation of missing values is also a painstaking task in this case study. For this, I used more occurring value in the missing values feature. For Example, I take the count of features values of those features which have missing values, and which value has more count I selected it for imputation.
  • Choosing the correct model for regression or classification tasks is also a big challenge. For this problem firstly I used the xgboost model but this model takes too much time as well as gives me RSME which is Beyond my expectation and it seems not to be suitable for our task. Then I go for the LGBM model because they are too fast and flexible. They give impressive results.
  • But there is always a scope for improvement. So, I choose the LGBM mode with repeated k-fold technique. Surprisingly, my RSME decreases from 3.65157 to 3.59687. which is like a miracle to see the power of the LGBM model with repeated k-fold.
  • Due to time constraints, I am not able to explore the first place solution trick which is linear stacking as described in the 1st place solution which can give a 0.015 boost in local cv compare with the same feature train directly.

12. Future Work

  • We have to be very careful about the preprocessing of data and feature engineering of data. Somewhere all the features need one-hot encoding for the transformation of ordinal categorical features.
  • There is another good solution in which linear stacking as described in the first place solution.
  • Adding more features like some special days which occur in a year proves to be useful with proper featurization technique.

12. References:

  1. https://medium.com/@blogsupport/elo-merchant-loyalty-recommendation-b28096c882b7
  2. https://medium.com/@narender.buchireddy/elo-merchant-category-recommendation-competition-on-kaggle-lightgbm-implementation-a-case-83b0912fa7f3
  3. https://www.kaggle.com/batalov/making-sense-of-elo-data-eda
  4. https://www.kaggle.com/roydatascience/elo-stack-with-goss-boosting
  5. platform Used : https://colab.research.google.com/

Course:

https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

Github Repo:

  • If you are interested in this case study or wants to improve it further, then Jupyter Notebook is available with all code at my following repo:-

LinkedIn: linkedin.com/in/abhishek-malik-905870135

--

--