Exploration Of Wines

9 min readApr 26, 2018

Outline of the project:

In this project as the name suggests “Exploration Of Wines”, we are going to identify the different varieties of wines based on their descriptions by doing sentiment analysis and using some Machine Learning text related predictive models.

I know most of you are familiar with wines ;-) But still before going to ML models, lets start with some basic introduction.

What is wine ?

Wine is an alcoholic beverage made with the fermented juice of grapes, but they are different than what you’ll find at the grocery store. Technically, wine can be made with any fruit (i.e. apples, cranberries, plums, etc) but most wines are made with wine grapes.

Wine grapes (the latin name is Vitis vinifera) are small, sweet, have thick skins, and contain seeds. There are over 1,300 identified, commercial wine grape varieties, but just about 150 of these varieties make up the majority of wine made in the world.

Speaking of differences, the difference between wine and beer is that beer is made from fermenting brewed grains. So very simply, wine is made from fruit and beer is made from grains[for more details].

Dataset:

The first step in order to do our task is to collect a data set that you can find on kaggle(link). This data set contains 14 columns and 130k rows of wine reviews. Once, you download the data set, you can load it using pandas in python :

data = pd.read_csv(‘winemag-data-130k-v2.csv’)

The data consists of 14 fields:

Country: the country where the wine is from.
Description: a few sentences from a sommelier describing the wine’s taste, smell, look, feel, etc.
Designation: the vineyard within the winery where the grapes that made the wine are from.
Variety: the type of grapes used to make the wine (ie Riesling)
Points: the number of points WineEnthusiast rated the wine on a scale of 1–100 (though they say they only post reviews for wines that score >=80)
Price: the cost for a bottle of the wine.
Province: the province or state that the wine is from.
Region_1: the wine growing area in a province or state .
Region _2: sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley).
taster_name : Name of a person who taste it.
taster_twitter_handle : twitter handle of a taster.
title: A brief title of a wine.
Winery: the winery that made the wine

#To print top 10 rows
data.head()

As you can see here that there are some features which contains NaN values and some of them like taster_name etc are unimportant for us in order to identify variety. So, we have to do some preprocessing on data before we move further.

Reading and Cleaning of data:

In my project, i am only taking certain features and removing rest of them.

data = data.loc[['description','designation','country','title','variety','price','winery']]

After removing some of the features, now the next task is to check for null values in the features.

0.04847235152457087 % of data points where country is null
28.825661109016625 % of data points where designation is null
0.0007694024051519185 % of data points where variety is null
6.921544036746659 % of data points where price is null

So, now i am going to eliminate all the null values using isnull() method.

66.28247839902748 % of data remained after eliminating null values

Check for duplicates:

You can see sum of duplicates using the code below:

dup_description = sum(data.duplicated('description'))
dup_title =  sum(data.duplicated('title'))
print(dup_description)
print(dup_title)

I decided to drop all duplicates based on the description column and title column.

Exploratory Analysis:

count = data['variety'].value_counts()
count.tail(20)

As you can see from above that there are some variety of wine whose count is very less so i am going to take only those values whose count is greater than 100.

data = data.groupby(‘variety’).filter(lambda x: len(x) >100)

Removal of stop words with NLTK in python

Definition of stop words : A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) , there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search[for more details].

We would not want these words taking up space in our database, or taking up valuable processing time because these words are not important for solving some problems based on text. For this, we can remove them easily, by storing a list of words that you consider to be stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords .

list of stop words: {'not', 'an', 'other', 'mightn', 'didn', 'myself', 'was', 'are', 'very', 'be', 'during', 'over', 'only', 'we', 'both', 'have', 'which', 'his', 'its', 'them', 'against', 'ma', 'after', 'more', 'most', 'won', 'your', 'do', 'you', 'o', 'isn', 'so', 'has', 'shan', 'own', 'couldn', 'their', 'my', 'yours', 'nor', 'is', 'when', 'by', 'each', 'm', 'am', 'don', 'd', 'in', 'further', 'all', 'these', 'does', 'ain', 'weren', 're', 'haven', 'hers', 'he', 'did', 'ours', 'hasn', 'himself', 'she', 'having', 'for', 'once', 'they', 'can', 'wasn', 'theirs', 'ourselves', 'under', 'hadn', 'too', 'any', 'no', 'before', 'shouldn', 'through', 't', 'yourself', 'those', 'had', 'needn', 'because', 'at', 'been', 'to', 'just', 'our', 'who', 'him', 'yourselves', 'should', 'same', 'then', 'and', 've', 'that', 'y', 'were', 'where', 'doesn', 'while', 'as', 'down', 'out', 'how', 'above', 'again', 'until', 'there', 'why', 'herself', 'it', 'here', 'some', 'will', 'i', 'itself', 'of', 'her', 'from', 'between', 'doing', 'this', 'what', 'a', 'such', 'wouldn', 'off', 'with', 'or', 'up', 'the', 'll', 'few', 'themselves', 'if', 'but', 'on', 's', 'about', 'me', 'than', 'now', 'being', 'whom', 'mustn', 'into', 'aren', 'below'}

As we are going to take text in description to identify variety of wines so we have to remove all those stop words in data[‘description’].

def nlp_preprocessing(total_text, index, column):
    if type(total_text) is not int:
        string = ""
        for words in total_text.split():
            # remove the special chars in review like '"#$@!%^&*()_+-~?>< etc.
            word = ("".join(e for e in words if e.isalnum()))
            # Conver all letters to lower-case
            word = word.lower()
            # stop-word removal
            if not word in stop_words:
                string += word + " "
        data[column][index] = string
for index, row in data.iterrows():
    nlp_preprocessing(row['description'], index, 'description')

You can see the difference in text data of description feature below:

Training and Testing data:

The reason for splitting the data set — evaluating the performance of classifier on the same set as it has been trained is unfair or poor practice, because we are not interested in how well the classifier memorizes the training set. Rather, we are interested in how well the classifier generalizes its recognition capability to unseen data[for more details].

Before divide our dataset into three subsets i.e., train, cross-validation, test. Lets encode our different classes i.e., variety into unique labels like 0,1,2,3….67 as there are 68 unique varities of wines.

data['variety'] = data['variety'].astype('category')
cat_columns = target.select_dtypes(['category']).columns
target[cat_columns] = target[cat_columns].apply(lambda x: x.cat.codes)

Now, we can split our data set randomly in training data as 64% ,16% in cross-validation set and 20% in testing set using train_test_split method of sklearn in python.

Conversion of text into vectors:

We cannot work with text directly when using machine learning algorithms. We need to convert the text to numbers. When we want to perform classification of text or documents, so each document is taken as “input” and a class label is the “output” for our predictive algorithm. Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers[for more details].

I. Using Bag of Words:

The bag-of-words model is one of the feature extraction algorithms for text. We’ll use the CountVectorizer() method of sklearn library in python to create vectors from the description.

II. Using Tfidf:

tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical method that is intended to reflect how important a word is to a document in a collection or corpus. The tf-idf value is proportional to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general[for more details]. We’ll use the TfidfVectorizer() method of sklearn library in python to create vectors from the description.

III. Using Word2vec:

Word2vec is a two-layer neural net and another most common technique that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand.

Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic. Those clusters can form the basis of search, sentiment analysis and recommendations in such diverse fields as scientific research, legal discovery, e-commerce and customer relationship management.

The output of the Word2vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply other ML models[for more details].

IV. tfidf word2vec :

After we find scores using TF-IDF vectorizer, we convert each description to a weighted average of word2vec vectors by these scores.

def get_word_vec(sentence, doc_id, m_name):
    # sentence : description of the wine
    # doc_id: description id in our corpus
    # m_name: model information it will take two values
        # if  m_name == 'avg', we will append the model[i], w2v representation of word i
        # if m_name == 'weighted', we will multiply each w2v[word] with the tfidf(word)
    vec = []
    for i in sentence.split():
        if i in vocab:
            if m_name == 'weighted' and i in  tfidf.vocabulary_:
                
                vec.append(tfidf_desc_features[doc_id, tfidf.vocabulary_[i]] * model[i])
            elif m_name == 'avg':
                vec.append(model[i])
        else:
            # if the word in our courpus is not there in the google word2vec corpus, we are just ignoring it
            vec.append(np.zeros(shape=(300,)))
    
    # each row represents the word2vec representation of each word (weighted/avg) in given sentence 
    
    vec = np.array(vec)
    vec = vec.mean(axis=0)
    return  vecfor i in data['description']:
    tfidf_w2v_descr.append(get_word_vec(i,doc_id,'weighted'))
    doc_id += 1

result of conversion of description into 300 dimensions using word2vec.

result of conversion of description into 300 dimensions using tfidfword2vec.

Applying ML models:

Before start training any of the machine learning’s classification algorithm first lets have a look on some constraints and performance metrics to follow :

Constraints:

low-latency requirement.
Interpretability is not that much important.
Errors cannot be very costly.
Actual prediction of a data-point belonging to each class is needed.

Performance Metric(s):

Accuracy score.

Applying ML models :

As you can see from above that Logistic regression using Tfidf give better results than others.

Thats all for now! but we could include other features or other models to see if accuracy increases, but for now I’ll settle with this and have a glass of wine for myself ;-)

Thanks for reading ….Cheers !

Exploration Of Wines

Conversion of text into vectors:

Written by Hina Sharma