Building a Recommender Engine Part III: Feature Engineering

Rachel Koenig
8 min readJan 3, 2020

--

This is part III of a four-part series describing my experience working on a project to build a recommender system. You can read part I about data collection with web scraping here and part II about cleaning and joining Dataframes here.

In this section, I’ll go over the types of feature engineering I used to prepare my data for machine learning. In the Data Science Immersive course I took recently, I remember frequently hearing “the key to a better model is feature engineering” and constantly wondering “what the heck does that even mean?!”. Well, eventually I understood it to mean that no matter which fancy algorithm you force your features into to try to get a better score if you haven't done proper EDA and feature engineering, your scoring metrics will never really improve that drastically. To me, feature engineering is a kind of catchall term for manipulating your table’s columns (features) to be either readable for a computer, like changing categorical data to 1’s and 0’s, or more predictive, like multiplying them together creating interactions terms and polynomial features.

At this point in my project, I have these columns in my product_info table: asin
category
color
demographic
department
description
detail_type
details
division
name
size
subcategory
type

a snippet of the table

As you can see above, columns like departments, demographic, details, category and division have short descriptions that are sometimes the same for multiple rows. Department, for example, has 17 different unique values with 12,876 rows containing “Clothing, Shoes & Jewelry”.

I want to turn these values into dummy columns, which means creating a new column for each different value, but first I need to prep the text to be better suited for column names by replacing all commas and ampersands with no space and replacing spaces with an underscore like so:

product_info['department'] = product_info['department'].str.strip(' ').str.replace(',', '').str.replace('&', '').str.replace(' ', '_').str.replace('__', '_')

This returns:

Then I cleaned all the rest of the relevant Series’ text and applied the pandas built-in function get_dummies() to each Series individually, dropping the first column to avoid redundancy, and saving as a new Dataframe. That way, I do not lose the original Series from the product_info table.

example with product_info[‘department’] Series

The dummy columns now have 1’s and 0’s for values instead of text. For example, the Clothing_Shoes_Jewlry column will have a 1 in every row of a product that was categorized as clothing, shoes or jewelry and a 0 in every row that is anything else, e.g. Sports & Outdoors, Toys & Games, Health & Household, etc.

Now I want to join all of my new Dataframes created from the dummies back onto my product_info. So I create a list of all the dummy Dataframe names, call it dummies and apply the function pd.concat() that we learned about in my part II post. At this point I feel comfortable dropping the original columns so I add .drop() with the list of columns I want to drop at the end.

dummies = [product_info, departments, demographic, details, category, division, subcategory, extra_dummy]product_dummies = pd.concat(dummies, axis=1).drop(columns=['department', 'demographic', 'detail_type', 'category', 'division', 'subcategory', 'type','extra_split'])

I then merged the product_dummies table with the reviews table and called it df.

And this is about the time when I discovered a new problem! So df.columns returns a list of all the column names and set(df.columns) returns a list of all the unique column names. When I checked the length of both of these, they were not the same length, which indicated some of my dummy columns must have had the same names and now I have duplicates!

The first thing I need to do is figure out which of my features (columns) were causing this double trouble. To do this I turned df.columns into a list and saved it as the variable list_cols. Then I used a for loop to loop through every item in the non-duplicated list (aka set) of df.columns and remove the item from list_cols. This would remove the full list of unique column names leaving me with just the duplicates.

I also needed to be careful not to just combine the columns by adding them together because I didn’t want any values higher than 1. So I looped through list_cols again, but this time created a new column for each and every column in the list (remember this is only the duplicates now). These new columns appended ontodf and were renamed with original column name plus the string “_agg” to the end and kept on the highest (aka max) value for each row which could only be either 1 or 0. df is not larger in size, so I drop all columns with a name in list_cols and it leaves me with a perfectly deduplicated 823 columns. Finally, I rename all the columns with _agg on the end. If you’re interested in the code, you can find it on my GitHub.

Next, so I could drop duplicate rows and work with a smaller DataFrame, I grouped all rows with text like summary and reviewtext together based on asin #s. To get the text ready to be used as content/features in the content-based recommender, I made a new column called all_text for all the text columns like all_reviews, one_sum, and description to be combined into one. I decided not to use details right away because it was messy and most cells were missing. I did not get rid of it though, in case I might end up needing it later.
I did determine though, that the size scrape came out way too messy and would probably not ever be useful in the future. The majority of cells were empty/missing and there was no way to parse it to consistently pull out the size text.

Using the groupby() function I found the average rating per product and created a new column containing that value called overall_mean.

Next, I dropped all the columns except name, asin, overall_mean and all the dummied department columns to create a category only DataFrame and saved it as a csv to use for the model.

Working with only the text columns now I narrowed in on the name and color columns to clean up and use in the model. I made the decision to start with those and add or eliminate more features later as needed. Using NLTK, I tokenized, removed stopwords, and lemmatized.

For the colors, I created a new column called colors_only using a list of colors from a dataset found on dataworld here to pull out any words that are actual colors from the name and color columns for both the product DataFrame and the Users DataFrame and eliminate the noise.

Lastly, I turned every color into its own binary column for a 1 if a product is that color or if a user purchased an item that color and a 0 if not. Unfortunately, this method of extracting color names from a string is not fool-proof. It does over-catch words and phrases that just have that word in them. For example, Learn French Rosetta Stone, returns both French Rose and Rose as colors. For the purposes of this project though, I decided to still use it because if it catches those “colors” in one product it’ll catch them in another and most likely that means those products have the same words in the title and are therefore probably similar.

From here, I already had more than 800 features with the categories only so I wanted to attempt the run the model to get an initial score before continuing.

After finding that a lot of products had the exact same departments/categories in common, I realized I needed to add more features after all. I had the product names in a column so I used CountVectorizer to split up the words individually and by 2- and 3-gram phrases. I set the binary parameter to true so that each word and phrase became a feature/column and each row/product got a 1 in the column if it contained that word or phrase in the name and a 0 if it did not instead of counting up the total times that word or phrase appeared in the document.

Other parameters included max_df of 2 which says that the word or phrase must appear at least two times in document to count and stop words were set to a list of colors I created from my colors dataset. There was no limit set to max features so I ended up with about 23000 columns. I was able to eliminate almost 200 by taking out any columns with integer-only names.
I joined the new features onto the categories column and ran the recommender again.

To further iterate, I joined on my color features onto CountVectorized-names and categories. This actually made the scores a little worse which led me to believe now I had too many features. At this point, I kept playing around with the preprocessing parameters until I was happy with the results.

Speaking of results, we can finally get to the fun part and find out what they are, as well as find out why Homer Simpson has been making an appearance this whole time!

Part IV: Modeling a Recommender Engine

--

--