Introduction to spaCy

Krati Agarwal
Analytics Vidhya
Published in
4 min readFeb 9, 2020

In Today’s world, we have billions of emails and chats generating every second and thus resulting in the un-realizable amount of text data generated, which needs to be processed. The main difficulty behind the processing of this text is the unstructured format of this type of data.

Therefore, to process this type of unstructured text, we need to use the concept of natural language processing.

What is Natural Language Processing?

Natural Language Processing is the branch of computer science, machine learning, and artificial intelligence which deals with the analysis and synthesis of natural language between computers and humans. But how the text is processed by computers so that it can easily analyze and understood by humans.

Natural Language Processing has a large number of applications in several different fields of sentimental analysis for identifying the sentiments from the text, rectifying the correct subject for the particular text, assigning a relevant advertisement automatically and also used as the chat-bots and voice assistants machine to under human speech and respond as soon as they get the input.

What is spaCy?

Spacy is a very popular Python, an open-source library used for natural language processing. It is widely used to process a large amount of text data and for developing machine applications that can process a large amount of text data effectively.

Spacy has many features and functionality related to text processing, like tokenization, part-of-speech tagging, dependency parsing, lemmatization, sentence boundary detection, named entity recognition, entity linking, similarity, text classification, rule-based matching, training, and serialization.

Tokenization: It is the process of conversion of large chunks of sentences into words.

Part of speech tagging (POS): It is the process of finding the functionality of each word in the sentence and tagging that particular word with a tag which can explain the context of the sentence.

Dependency parsing: It helps in assigning the syntactic dependency labels that help in defining the relation for each token.

Lemmatization: It is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. For example, sleeping is mapped to sleep, and gone is mapped to go.

Sentence Boundary Detection: It helps in segmenting the different sentences in the large chunks of text data.

Name entity recognition: It helps in labelling real-world objects like persons, countries, locations, etc.

Entity Linking: To ground the named entities into the “real world”, spaCy provides functionality to perform entity linking, which resolves a textual entity to a unique identifier from a knowledge base (KB).

Similarity: This parameter helps in checking the similarity between the two text documents by comparing words and text span.

Text Classification: It helps in assigning categories and labels to the whole documents or parts of the documents.

Rule-based-matching: Using rule-based matching, one can identify the particular set of rules that are used for the text data or used for extracting text based on the particular text rules.

Major features in spaCy

Some of the spacy features work completely independent, but some require other statistical models to predict linguistic annotations. For example, if one wants to predict if a word is a noun or a verb. Spacy has several statistical models built for a large number of languages; these statistical models have different sizes, speeds, accuracy and data sets. Every statistical model has different functionality and hence can be included based upon the different use-case accordingly. The general purpose of use-case functionality can be built on small, simple models.

The major component of these statistical model is binary weights, lexical entries, data files, word vectors, and configuration.

--

--

Krati Agarwal
Analytics Vidhya

Data Science and Machine Learning enthusiastic. I believe in working on data to make data work for us.