Machine Learning: Intent Classification

Published in

Analytics Vidhya

4 min readJun 8, 2020

In today’s world where people are pouring their hearts and thoughts on social media, ones’ all the desires and intentions can be found from what that person is sharing or commenting on social media. These comments can be a strong medium to find potential customers for anything product. Recently, I made a project to understand the intent of a comment to understand if that sentence shows the intent of purchase or not which is called Intent Classification.

A little bit about intent classification

Intent classification is a part of Natural Language Processing, which is targeted towards the classification of text in various categories, for a better understanding of the text. For example, websites for checking grammar and sentence errors also provides a feature to show the tone of the sentence. Now, without any delay dive into the code.

Data preprocessing

For this project, I used data with comments about a product and intent as PI(potential intent) and no. The data file was saved as a comma-separated file.

The code for data read and pre-processing is as below:

import pandas as pd
df = pd.read_csv("intent_data.csv")
# Removing all the duplicate data
df.drop_duplicates(inplace=True) #Renaming column names for easy understanding
df.columns = ["index","class","tweets"]

So, till here we have all the data stored in a data frame with column names as “index”, “class” for the intent of purchase and “tweets” for comment about the product. Now, our class data is in text format which is not compatible with the training a model. So, we will convert the data in binary where 1 will be intent of purchase and 0 will be non-intent of purchase.

df['class'] = prepareY(df['class']).copy()
y = df.iloc[:,1].values # Ready for the trainingdef prepareY(dfy):  dfy = dfy.str.lower()
  dfy[dfy == 'pi'] = 1
  dfy[dfy == 'no'] = 0
  
  
  return dfy

Now, we need to process our text data for the training. In normal text sentences, too many punctuations, as well as other words, are used which are too redundant, these words create unnecessary noise while training. Also, in the sentences, we have different forms of the single word which makes it hard for the model to understand the meaning. for this problem, we will use lemmatization. So, in code below we will try to re-structure our class column.

text = df.iloc[:,2].values
wordnet_lemmatizer = WordNetLemmatizer()
stops = stopwords.words('english') 
nonan = re.compile(r'[^a-zA-Z ]')
x = []
for i in range(len(text)):    sentence = nonan.sub('', text[i])    words = word_tokenize(sentence.lower())    filtered_words = [w for w in words if not w.isdigit() and not w in stops and not w in string.punctuation]    tags = pos_tag(filtered_words)
    cleaned = ''    pos = ['NN','NNS','NNP','NNPS','RP','MD','FW','VBZ','VBD','VBG','VBN','VBP','RBR','JJ','RB','RBS','PDT','JJ','JJR','JJS','TO','VB']    for word, tag in tags:
       if tag in pos:
       cleaned = cleaned + wordnet_lemmatizer.lemmatize(word) + ' '
       
    x.append(cleaned)

After processing our text data from figure 1 would look like the image below without stop words and fluctuations.

Model training

Now, we will move towards the training part. For, the training I have transformed data into TF-IDF and trained with the XGBoost classifier.

tfidf_vectorize = TfidfVectorizer()
vectors = tfidf_vectorize.fit_transform(x)
features = tfidf_vectorize.get_feature_names()dense = vectors.todense().tolist()
x = pd.DataFrame(denselist, columns=features)from sklearn.model_selection import train_test_split# splitting data into 80% train and 20% test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0) from xgboost import XGBClassifier    classifier = XGBClassifier()
classifier.fit(x_train, y_train) y_pred = classifier.predict(x_test)from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
accuracy = metrics.accuracy_score(y_test, y_pred)

Now, let’s try to plot the performance of the model with cross-validations.

plot_learning_curve(classifier, "XGBoost", x, y, ylim=(0.7, 1.01),cv=5)def plot_learning_curve(estimator, title, X, y, axes=None, 
        ylim=None, cv=None,n_jobs=None,   
        train_sizes=numpy.linspace(.1, 1.0, 5)):    if axes is None:
        _, axes = plt.subplots(1, 3, figsize=(20, 5))    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")    train_sizes, train_scores, test_scores= \
        learning_curve(estimator,X,y,cv=cv,n_jobs=n_jobs,
        train_sizes=train_sizes)
    train_scores_mean = numpy.mean(train_scores, axis=1)
    train_scores_std = numpy.std(train_scores, axis=1)
    test_scores_mean = numpy.mean(test_scores, axis=1)
    test_scores_std = numpy.std(test_scores, axis=1)
# Plot learning curve
    axes[0].grid()
    axes[0].fill_between(train_sizes, train_scores_mean -
      train_scores_std,train_scores_mean + train_scores_std,  
      alpha=0.1,color="r")
    axes[0].fill_between(train_sizes, test_scores_mean - 
      test_scores_std,test_scores_mean + test_scores_std, 
      alpha=0.1,color="g")
    axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
    axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    axes[0].legend(loc="best")

Visualization

The performance curve for the trained model is shown in the image below.

Summary

Here, I have shown how an intent classification model can be trained and visualized for classification of purchase intent. The code for this can be found here.