What is SpaCy?

Divya P
featurepreneur
Published in
3 min readDec 17, 2021

In this article, we will learn the need for spacy and basic needs.

Spacy is basically an open-source python library for Natural language processing. It is used to process a large volume of texts. Spacy stores the text in the hash value.

Spacy can recognize nouns, verbs, adjectives, etc. Many text classification models can be built using spacy. It is used to find the sequence and express a regular expression.

Before processing the text must be made an NLP object. The text must be tokenized. Tokenization is the process of splitting text into small units. In other words, removing punctuations and spaces.

1. Install the Spacy library using pip

pip install spacy
python -m spacy download en_core_web_sm

2. Load spacy

nlp = spacy.load("en_core_web_sm")

“en” defines the English language model. “sm” defines the size of the model as small. The model is available in various sizes.

3. Parsing

text = "A.P.J. Abdul Kalam,(born October 15, 1931, Rameswaram, India—died July 27, 2015), Indian scientist and politician who played a leading role in the development of India."doc = nlp(text)
[(token.text, token.dep_)for token in doc]
Output

4. To find the parts of speech in the given text

Here we have defined only four grammar parts. So, the text is classified only based on these four parts of speech.

result = []
grammer = ['PROPN', 'ADJ', 'NOUN', 'VERB']
doc = nlp(text.lower())
for token in doc:

if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
continue

if(token.pos_ in grammer):
result.append(token.text+"-"+token.pos_)
print(result)
Output

5. Rule-based Matching

It is used to match the word irrespective of Case sensitivity.

from spacy.matcher import Matcher 
from spacy.tokens import Span
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "python"}]
matcher.add("Tech", [pattern])

Here the word python is the keyword. It will identify Python, PYTHON, etc... Add the pattern to the previously defined matcher.

doc = nlp("Python is an interpreted high-level general-purpose programming language. PYTHON is dynamically-typed and garbage-collected.")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
Output

6. Merging:

nlp = spacy.load("en_core_web_sm")
doc = nlp("Stock Market is volatile")
print("Before:", [token.text for token in doc])
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[0:2], attrs={"LEMMA": "stock market"})
print("After:", [token.text for token in doc])

Here the stock market is tokenized as two different words. But after retokenizing, it will be a single word.

Thanks for reading. Hope you found it useful.

Find me here:

--

--