Machine Learning/Statistical Models
Since we are not trying to predict anything, the only machine learning techniques that would be applicable to our project would be unsupervised machine learning techniques, however, clustering is not a natural concept because we are building our model on words, which do not have distances, which cannot be clustered upon because the distance between words is trivial.
Therefore, we calculate the cosine similarity of wines in the feature space of their descriptive words. An example abstraction of this is below:
Since we are creating a feature space of over 200,000 words, the model is quite abstract and complicated. The wines are vectors that are weighted in each dimension by the TF-IDF vectorization of the words. The cosine similarity is then calculated as the angle between wine vectors.
The TF-IDF stands for Term Frequency - Inverse Document Frequency. Essentially, TF-IDF vectorization weights each wine, word combination by the frequency of the word within the wine review divided by the frequency word across all wine reviews.
Ultimately, we are left with a matrix of each wine’s cosine similarity to every other wine. In this way, given a wine we can return the top ten most similar wines.
In addition to our wine to wine recommender, we also created a tasting note to wine recommender. This takes in a string of tasting notes and finds the top five most similar wines based on string. In order to do this we matched the string to the wines that are most frequently described by the words in the string. This is essentially a more efficient way of calculating the cosine similarity of a given string to the reviews of a given wine.