Building a Recommender Engine Part II: Data Transformation

Rachel Koenig
6 min readJan 2, 2020

--

This is part II in a series describing my experience working on a project to build a recommender system. You can read part I here.

In this section, you will learn how I cleaned text data from a web scrape to be usable in a model. Some of it is a bit repetitive from my previous post about NLP so I won’t go into great detail on that portion. If you’d like to learn more specifics on that topic, you can find my article here. You will also learn how to concatenate and merge Dataframes.

Data Cleaning Function

I start with the imports.

import pandas as pd
from bs4 import BeautifulSoup

I load in my first .csv file and call it df1.

If you’ve used pandas before, you’re probably familiar with the index column that gets added on to the left side of the Dataframe when you read in a csv. Rather than dropping it later, I like to prevent it from happening at all by adding theindex_col=[0] parameter.

df1 = pd.read_csv('scrapes/product8000.csv', index_col=[0])

Then, since I know I have 14 csvs I need to clean and combine, I define a function to remove Html and one to apply remove_html and perform a number of cleaning steps. This way I can easily run the same steps over and over on each csv without having to write the code every time. Note, each dataset has seven columns: asin, category, color, description, details, name, size.

Define remove_html() function:

def remove_html(text):
try:
soup = BeautifulSoup(text, 'lxml') # Instantiate BeautifulSoup
html_free = soup.get_text(strip=True) # Strip html, leaving just text
return html_free
except:
return "missing" # if there is nulls in the column, fill with "missing" string

Define EDA() function:

def EDA(df):
# Remove html and strip off brackets and 'Color:' from color column
df['color'] = df['color'].apply(lambda x: remove_html(x)).str.replace(']', '').str.replace('[', '').str.replace('Color:', '')

# Replace escape characters and white space in Category column
df['category'] = df['category'].str.replace('\n', '').str.replace(' ', '')

# Split Categroy column on the '›' symbol, up to 6 times and return them in a new df where each split is a new column
category = df['category'].str.split("›", n=6, expand=True)

# Rename each category split column and add it onto the original df
df['department'] = category[0]
df['demographic'] = category[1]
df['division'] = category[2]
df['category'] = category[3]
df['subcategory'] = category[4]
df['type'] = category[5]
df['detail_type'] = category[6]

# Remove special characters from the description column
df['description'] = df['description'].str.replace('\t', '').str.replace('\n', '')

# Remove html and strip off brackets from details column
df['details'] = df['details'].apply(lambda x: remove_html(x)).str.replace(']', '').str.replace('[', '')

#Remove html and cut off 'Size:' from the size column
df['size'] = df['size'].apply(lambda x: remove_html(x)).str.replace('Size:', '')

return df.head()

The function EDA() is specific to my data, but could be easily adapted if you have a similar dataset with different column names. It will

  • strip off brackets and the word ‘Color:’ from color column
  • replace escape characters and white space in category column
  • split category column on the › symbol up to 6 times and return in a new df where each split is a new column then rename each category split column, adding it onto the original df
  • remove special characters from the description column
  • remove Html and strip off brackets from details column
  • remove Html and cut off ‘Size:’ from the size column

Joining Dataframes

After loading all 14 of my csvs and applying the EDA() function to all of my scraped data, I now want to combine everything into one giant Dataframe.

To do this I used pd.concat(). Since I have more than one Dataframe(df), I can create a variable for the list of df names and then apply the concat function to the list. I wanted my Dataframes to stack on top of each other, not next to each other so I also specified axis=0 for the row axis, this is the default but I wanted to be explicit. I also included sort=True which puts the columns (or whichever axis was not concatenated) in alphabetical order, the default here is None. The ignore_index parameter is used to ignore the previous indexes so that the new index goes in order 0 to n-1. I needed to specify this because each df had its own index starting at 0 and once stacked on top of each other, the index would count 0 to n-1 and then start over again at 0 to n-1 and so on for as many Dataframes as were stacked. Lastly, I added .fillna(0) to the end to fill in any null cells with a zero. All of this created a new df, which I named total_product.

dataframes = [df1, df2, df3, df4]   
total_product = pd.concat(dataframes, axis=0, sort=True,
ignore_index=True).fillna(0)

If you only have two Dataframes to join, it is mostly the same except both df names go straight into the formula as a list.

final_df = pd.concat([total_product, df5], sort=True).fillna(0)

Once I had my final_df, I still had some cleaning to do. I checked the value counts of the name column again and found that I still had some that were unknown.

I need to know the product names because that’s what the user will search by in my final recommender engine. Therefore, these are of no use to me so I’ll have to drop them. However, this search also caught my attention because I didn’t think there should be more than one of any product and as you can see above, quite a few were returning 6 and 7 results. I suppose it is possible to have two products with the same name though, so to be sure, I check the asin column because those are a unique identifier.

Checking the first 25 rows of the df
Checking the last 25 rows of the df

Sure enough, these value counts showed me that some products definitely got scraped twice, which is a problem! So I proceeded to drop the duplicates and then checked the df shape to confirm the number of rows decreased from the current 19,766.

final_df.drop_duplicates(inplace=True) 
#inplace = True makes it a permanent change
final_df.shape

The output is 16,816 rows now, confirming I deleted almost 3000 rows.

Then I noticed that there were still 7% unknown department rows and 11% of department rows with a 0s which meant that none of the subdepartments would have values either. I made the decision to remove these as well because they would create similarities between unrelated products and hurt my model. To do this, I used a mapping technique to show only the rows of final_df where the department is NOT equal (!=) to unknown category AND also not equal to 0 and saved it as final_df again.

final_df = final_df[(final_df['department'] != 'unknown category') & (final_df['department'] != 0)]

This time when I check the shape I’m down to only 13,732 unique rows.

Merging Dataframes

Next, I need to merge my newly collected product content Dataframe which I call product_info with my previously acquired Amazon review Dataframe (reviews) using pd.merge(). With this function, I needed to specify which column the two datasets have in common with the on= parameter and which type of join I needed with the how= parameter. I merged on the asin column, and I chose a left join because I wanted to keep all the information from the reviews table even if an asin was not in the product_info table. The default inner join would only keep rows that were in both tables. I did not have to specify which axis to merge on because pd.merge() can only attach tables side to side.

df = pd.merge(reviews, product_info, on='asin', how='left')

After merging the reviews and product info DataFrames, I check for nulls and see I have almost 39% missing. These are the rows of products that I was not able to scrape. This type of data is called missing at random and I have to drop it because it doesn’t exist and I have no access to accurate information to fill it in.

The datasets used for this project can be found on my GitHub.

Part III: Feature Engineering with dummy columns and CountVectorization

Part IV: Modeling & Results!

--

--