Learn how to use spaCy for Natural Language Processing

Suraj Desai
Analytics Vidhya
Published in
5 min readOct 11, 2019

Why we need NLP:

Natural Language Processing(NLP) is a field in machine learning with the ability of a computer to understand, analyze, manipulate, and potentially generate human language. Under the hood, machine learning algorithms are nothing but a bunch of math calculations, and obviously, if we pass words/sentences to machine learning models they don’t know how to deal with it. Hence what we need to do is to convert them into vectors which ml models can understand and perform operations on them. And this is where all the NLP libraries like spaCy, nltk come into the picture.

spaCy?

spaCy provides a one-stop-shop for tasks commonly used in any NLP project, including:

  • Tokenization
  • Lemmatization
  • Part-of-speech tagging
  • Dependency parsing
  • Word-to-vector transformations
  • Many conventional methods for cleaning and normalizing text

In this article, we are going to focus only on Word-to-vector transformations.

Let’s get started:

But firsts things first. We must know how to install spaCy and pip does this task for us, just two commands to execute and we are good to go.

pip install spacypython -m spacy download en_vectors_web_lg

The first command installs spaCy for us and the second one downloads spaCy models that contain built-in word vectors.

As we are done installing spaCy, let’s download a dataset which contains tweets from twitter using the below link.

Now that you have download dataset, let’s load the data using pandas

>>data = pd.read_csv("train.csv")>>tweets = data.tweet[:100]

The tweets variable holds the tweets column only. Let’s have a look at the top five tweets. For our learning purpose, we are taking the top 100 tweets.

>>tweets.head().tolist()[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',"@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",'  bihday your majesty','#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',' factsguide: society now    #motivation']

Well before we use spaCy, we need to clean the data so that we can get meaningful words out of which we can make some sense. And let’s not waste time discussing how to clean data as our main focus is on word-to-vector transformation using spaCy, hence I am just going to paste my code, which cleans data and skip the explanation part considering you know how to clean it.

""" Cleaning Tweets """
tweets = tweets.str.lower()
# removing special characters and numbers
tweets = tweets.apply(lambda x : re.sub("[^a-z\s]","",x) )
# removing stopwords
from nltk.corpus import stopwords
stopwords = set(stopwords.words("english"))
tweets = tweets.apply(lambda x : " ".join(word for word in x.split() if word not in stopwords ))
>>tweets.head.tolist()['user father dysfunctional selfish drags kids dysfunction run', 'user user thanks lyft credit cant use cause dont offer wheelchair vans pdx disapointed getthanked',
'bihday majesty',
'model love u take u time ur',
'factsguide society motivation']

As we have cleaned tweets lets jump right into spaCy.

Creating Tokens using spaCy:

Creating tokens using spaCy is a piece of cake.

import spacy
import en_vectors_web_lg
>>nlp = en_vectors_web_lg.load()
>>document = nlp(tweets[0])
>>print("Document : ",document)
>>print("Tokens : ")
>>for token in document:
print(token.text)
Document : user father dysfunctional selfish drags kids dysfunctionTokens :
run
user
father
dysfunctional
selfish
drags
kids
dysfunction
run

en_vectors_web_lg.load() loads spaCy model and stores it into the nlp variable. This model is trained using 1million words.

In nlp(string) we pass the document which is then converted into “spacy.tokens.doc.Doc” and stored in a variable document. We can see when we printed a document it seems like a string to us but don’t be mislead yourself, its actually spacy object which can be iterated. When spacy object is iterated what we get is tokens.

Token-to-vector:

The road to reach from token to vector is also easy. Let me show you with the help of code

>>document = nlp(tweets[0])
>>print(document)
>>for token in document:
print(token.text, token.vector.shape)
user (300,)
father (300,)
dysfunctional (300,)
selfish (300,)
drags (300,)
kids (300,)
dysfunction (300,)
run (300,)

“token.vector “ creates a vector of size (300,1). The above code was to get vector out of every single word, of a single sentence/document. But what if we have 100 such sentences/document in our corpus, are we going to iterate over every sentence and create a vector for every word and just add them up. Nah that’s the wrong way to do. What we can do is use nlp.pipe().

Sentence-to-vector using pipe:

The nlp.pipe() process texts as a stream and buffer them in batches, instead of one-by-one, and convert each document into spacy object. This is usually much more efficient. And then instead of iterating over each token of document what we can do is iterate over each document and get a vector for document instead of the word. Isn’t it impressive, well I find so.

>>document = nlp.pipe(tweets)
>>tweets_vector = np.array([tweet.vector for tweet in document])
>>print(tweets_vector.shape)
(100, 300)

Hence we got the vector of 100 tweets of 300 dimensions. Now what we can do is, use these created vectors to design a simple model like Logistic Regression, or SVM to detect whether the speech is racist or not. And only simple models we can also use these vectors to train neural networks.

Logistic Regression Model:

As we have cleaned our tweets and created tweets into vector, we can use those vectors to predict if a tweet is racist or not. “Label” column in the dataset has values 0 and 1. 0 means tweet is not racist and 1 means tweet is racist. Note for modelling purpose I have taken the whole dataset and not just the top 100 points. So let's predict by creating a simple model.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X = tweets_vector
y = data["label"]
X_train,X_test, y_train, y_test = train_test_split(X,y, stratify=y, test_size=0.3, random_state=0)model = LogisticRegression(C=0.1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy on test data is : %0.2f" %(accuracy_score(y_test, y_pred)*100))
y_train_pred = model.predict(X_train)
print("Accuracy on train data is : %0.2f" %(accuracy_score(y_train, y_train_pred)*100))
-> Accuracy on test data is : 94.49
-> Accuracy on train data is : 94.50

Well now that u got familiar with spaCy, you can play around with it and explore more about it from spaCy’s documentation as we have just scratched the surface of it. Things like part of speech tagging, entity recognition as discussed above can also be done using spaCy.

Thank You.

--

--

Suraj Desai
Analytics Vidhya

Passionate about all things computer science and tech, I'm constantly exploring the latest trends and advancements in the field.