Wine Recommendation System

Saleena John
Web Mining [IS688, Spring 2021]
5 min readMay 8, 2021

Apart from being tasting good, wine is a complex topic that is intermingled with history, culture, geology and genetics. For quite some time, wines were a regional delicacy. Only certain regions in the world were able to produce certain kinds of wine, mainly because of the peculiarity of the region which helps in fermenting the wine. Now that technology has evolved wine is readily manufacturable in every part of the world and this made it affordable for every consumers. There are mainly six variants of wines (but in real there could be more) and a lot of sub-variants in them. Sometimes it might be confusing which wine to gift someone or have one for yourself.

So, in this project using the Wine Reviews data from Kaggle, I am going to create a wine recommendation system that can suggest you wines according to the variant, price or region of origin. This data set contains wine reviews given by different customers and wine tasters across the globe.

The data set came as two different .csv files which I concatenated to one file

path = '/Users/saleenajohn/Desktop/WebMining/Assignment5/'
wine_150k = pd.read_csv(path + 'winemag-data_first150k.csv',index_col=0)
wine_130k = pd.read_csv(path + 'winemag-data-130k-v2.csv',index_col=0)
wine = pd.concat([wine_150k,wine_130k],axis=0)
#wine = wine_150k
print("Number of rows and columns:",wine.shape)
Output:
Number of rows and columns: (280901, 13)

There are 13 columns in the data set namely

  • country: The country that the wine is from.
  • description: Description about the wine.
  • designation: The vineyard within the winery where the grapes that made the wine are from.
  • points: The number of points WineEnthusiast rated the wine on a scale of 1–100 (though they say they only post reviews for.
  • price: The cost for a bottle of the wine in USD.
  • province: The province or state that the wine is from.
  • region_1: The wine growing area in a province or state (ie Napa).
  • region_2: Sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank.
  • taster_name: Name of the consumer or wine taster.

Data Analysis

Description of the columns

wine.describe(include='all',).T

In order to clean the data we have to find the missing values.

msno.bar(wine,color= sns.color_palette('viridis'))

From the bar chart we can see that only description, points and winery columns have no missing fields.

Now, before jumping to recommendations, let’s find out some interesting insights from these data.

The graph shows some discrepancies with the real data. The largest wine producing country in the world is Italy followed by France. But why this graph shows US as the top wine producing country? The major reason being this is not a wine data set, but a wine reviews data set. The data depends on the wines that the reviewers consumed.

Most of the wines’ price is under $300. Only a very few wines have prices above $750.

The country-wise average prices of wine shows that wines are more cheaper in the North African countries and it is expensive in the Europe and North American continent.

The figure above shows the number of wine tasters country-wise. We can see that the US has most number of wine tasters which justifies the distribution of wine reviews graph.

From the figure above, we can see that France is selling the most expensive wines followed by the US and Austria. While, the least priced wines are sold by Portugal, Argentina and Spain. US has both expensive as well as cheap wine consumers. European countries sell average priced wines.

Collaborative Filtering

For this project, I am using Collaborative Filtering for recommending wine varieties. It is the process of filtering for information or patterns using collaboration of similar data. This project specifically uses the nearest neighbor algorithm. Similarity is found using cosine similarity algorithm and it is defined as the cosine of angle between the vectors of two item vectors. The cosine will be larger as the items become closer.

col = ['province','variety','points']
wine1 = wine[col]
wine1 = wine1.dropna(axis=0)
wine1 = wine1.drop_duplicates(['province','variety'])
wine1 = wine1[wine1['points'] >85]
wine_pivot = wine1.pivot(index= 'variety',columns='province',values='points').fillna(0)
wine_pivot_matrix = csr_matrix(wine_pivot)
knn = NearestNeighbors(n_neighbors=10,algorithm= 'brute', metric= 'cosine')
model_knn = knn.fit(wine_pivot_matrix)

Prediction

Now, using the above algorithm, let’s find top five recommendations for a random selection of wine.

query_index = np.random.choice(wine_pivot.shape[0])
distance, indice = model_knn.kneighbors(wine_pivot.iloc[query_index,:].values.reshape(1,-1),n_neighbors=6)
for i in range(0, len(distance.flatten())):
if i == 0:
print('Recmmendation for {0}:\n'.format(wine_pivot.index[query_index]))
else:
print('{0}: {1} with distance: {2}'.format(i,wine_pivot.index[indice.flatten()[i]],distance.flatten()[i]))

But a random selection of wine is not what is needed for most consumers. So I constructed a pivot table in order to perform iterative filtering.

pivot_table = wine_rec.pivot_table(index = ["price"],columns = ["variety"],values = "points")
pivot_table.head(10)
tasted_wine = pivot_table["Aglianico"]
similarity_with_other_wines = pivot_table.corrwith(tasted_wine)
similarity_with_other_wines = similarity_with_other_wines.sort_values(ascending=False)
similarity_with_other_wines.head()

Using this technique, users can find recommendations for a wine that they have tasted.

--

--