Alternatives For Favorite Perfumes

Published in

INST414: Data Science Techniques

5 min readDec 4, 2023

Insight

Perfumes are a luxury item that a lot of consumers choose to invest in. A lot of perfume buyers choose to build collections of various perfumes which are sometimes diverse and sometimes contain perfumes of similar scents and notes. Using a Sephora dataset, I wanted to choose three target perfumes and find perfumes similar to each target perfume. A list of similar perfumes can help buyers and collectors find alternatives to their favorite fragrances and help them build their collections. A list of similar perfumes can also inform beauty advisors of what perfumes to recommend when someone is looking for fragrances similar to a specific fragrance.

Data Source and Similarity Metric

For this analysis, I wanted to use perfumes that were specifically listed on the Sephora website. After some research, I found a dataset on Kaggle containing thousands of products and their information from the Sephora website. The dataset contained 9,168 rows and 21 columns. The columns included information like brands, categories, ratings, details, ingredients, and more. To find the similarity between the three selected target perfumes and all other perfumes in the dataset, I used the “details” feature. The “details” column included the perfumes’ fragrance family, scent type, key notes, and description. The similarity metric I chose for this analysis is Jaccard similarity. In the context of perfume analysis, Jaccard similarity takes the intersection and union of the target perfume and another perfume and then divides the intersection by the union. The intersection would be common elements found in the details of the two perfumes, and the union would be all unique elements from the details of both perfumes. The score is calculated when the intersection is divided by the union. A Jaccard similarity score will always be a value between 0 and 1.

Software Used

To facilitate my analysis, I used Pandas, a library in Python. Pandas allowed me to read the dataset into a DataFrame and select specific columns. I also was able to perform some data cleaning on the “details” column. In addition to Pandas, I also used Matplotlib, a library in Python used for visualization. I was able to use Matplotlib to visualize my results and create bar graphs.

Data Cleaning and Bugs

The Sephora dataset contained rows about all types of products, including makeup, hair products, skincare, makeup tools, gift sets, and more. However, I only needed information about perfumes. To filter out columns and rows that I did not need, I used .loc to only return rows where the category was perfume. I also only included the columns for brand, category, name, price, rating, and details. I was originally going to include ingredients as well, however, multiple rows said “unknown” for ingredients. Furthermore, I used split() to split the text in the “details” category into a set of words. After creating a new DataFrame with only the information that I needed, I converted all the text in the “details” category to lowercase and removed non-alphanumeric characters.

One bug or error that someone could encounter with this analysis is missing values. For example, I originally wanted to calculate Jaccard similarity using the “ingredients” column, however, multiple columns contained “unknown.” To fix this problem, I used the “details” category instead. The details contained sufficient information about the perfumes, including some ingredients. Another issue someone could encounter is an error with calculating the Jaccard similarity scores. If the formula used is incorrect, it could result in inaccurate scores and similarity results. Fortunately, I was able to properly calculate the similarity scores.

Results

The three target perfumes I selected were “KAYALI VANILLA | 28” by Kayali, “Juicy Couture” by Juicy Couture, and “Her Eau de Parfum” by Burberry. A for loop was used to iterate through the target perfumes, count the number of common elements between a target perfume and non-target perfume (intersection), count the number of unique elements between the perfumes (union), and then divide the intersection by the union. The similarity scores were then stored in a similarities dictionary.

Below are the top 10 similar perfumes for each target perfume based on their Jaccard similarity score.

Below is a DataFrame which shows more details about the similar perfumes.

Limitations and Biases

One of the limitations of this analysis is that the dataset used is not up to date and was last updated around 4 years ago. This means that there might be newer perfumes that are more similar to the target perfumes than the ones found in this analysis. Another limitation is the inclusion of different sizes of the same perfume. For example, the dataset includes regular-sized perfumes, then it also includes mini and rollerball versions of the perfumes. As a result of this, some regular perfumes were returned as similar, as well as a rollerball version of the similar perfume. My analysis could have also used some further data cleaning. The “details” column was very populated and often contained long descriptions. Similarity scores and results could have been more accurate if the “details” category only included fragrance family, scent type, and key notes and excluded the descriptions. One of the possible biases that could be in this analysis is incomplete comparison. Comparing ingredients could have been a more direct way to compare perfumes and find similarities. Since the details were very long and wordy, there may have been some unnecessary elements which were used in the calculations.

You can find the code for this analysis here: https://github.com/adasti/INST414

You can find the Sephora dataset here: https://www.kaggle.com/datasets/raghadalharbi/all-products-available-on-sephora-website/data