NLP: All the Features. Every Feature That Can Be Extracted From the Text

Published in

The Startup

5 min readSep 27, 2020

tldr: I will be sharing all the possible NLP features that you can extract from unstructured texts for using in downstream tasks. I also list the python libraries I prefer to use for computing these features.

Text data and analysis has been one of the biggest trends and leaps of the last decade, especially the second half. This can be easily seen by the advancement of algorithms used by the tech titans in FAANG, as well as hundreds of startups that are disrupting various industries at a breakneck speed. Few of these industries have been completely transformed, while others are inevitably in the process of it. An example from personal experience — there is a huge disruption in the marketing and advertisement sector where the big agencies(like Dentsu, WPP, Publicis) are seeing a significant change in the way of doing and executing things.

I have spent the last 3 years working with a variety of textual datasets. Primarily, I have worked with advertisement datasets, for optimization tasks as well as generation from scratch.

When the sun is brighter, you would get to work with organized and ordered datasets, otherwise, it’s all over the place from all sorts of sources. [jk, working with the latter is exciting, though quite unpredictable — literally. You never know what patterns will emerge to surprise you].

I will list down in detail, the shallow as well as deep features, that one can use as signals for downstream tasks like classifications, insights, visualizations, and so forth.

Shallow Features

By shallow features, I refer to the simple features that are relatively easier and simple to compute and are less compute-intensive. Also, these are directly interpretable unlike their more complex siblings discussed later.

text length counts :

features like character length and word length are quite common to be significant in text datasets. For documents or paragraph-like data points, you can use sentence length mean and std values as features.

non-dictionary word counts

count/ratio of non-dictionary words or OOV(out of vocab) words in the text. It can be a pseudo-feature representing how formal the text is. Twitter type social media datasets will have higher ratios compared to Wikipedia articles.

readability metrics

Metrics like Flesch-Kincaid Readability Test and SMOG can be used. Essentially, these are ratios of complex words(polysyllables) to the total number of words or sentences — which are hypothesized to signal how easily readable a piece fo text is.

unique word ratios

ratio of number of unique words/total words. This feature gives you a sense of word repetitions in the data points. For a deeper analysis of this type of feature, you can look into TF-IDF models.

sentence types

Features such as count, or boolean variable for the presence of questions, exclamations, and particular punctuations.

emojis and hashtags

A feature of the count or boolean for their presence in a text. Sentiments of these features can also be further used on top of these.

Complex/Deep Features

Now let’s look at slightly complex features. These features typically require higher machine computation resources; GPU's for large datasets(100K+ data-points). Also, these are harder and time-consuming to construct on your own, so you’ll end up using standard libraries for these. I will mention and compare the famous libraries for each of the following.

part of speech(pos)

If there’s a single feature I have used the most in my experience, it’s the Part-Of-Speech(POS) tags. They are incredibly powerful.

POS illustration as visualized on spaCy’s Documentation website.

A lot can be learned from what POS’s are present in a sentence. This can be extended to include the dependency features, which are the relations between the words. These end up in constructing a dependency tree for each sentence.

Practically, I often use POS to extract the relevant tags(like NOUN, VERB, ADJ). Then you can perform operations on the corresponding word-vectors(Glove/word2vec), like similarity/topic-modeling/classification. I use spaCy for this task, which is unquestionably one of the best libraries for this and other NLP tasks.

To use POS, often the best way is to generate POS tags for the sentences and then try to find patterns in the dataset yourself.

Manually going through the dataset to find patterns is the most important and often under-rated task in the field.

named entity recognition(ner)

Another related sibling of POS tagging is NER tagging. Again, spaCy is my choice for it. NER is used to identify named-entities like person names, organizations, locations, quantities, monetary values, percentages, etc. This is also a good feature that I use to filter my datasets for text generation tasks — as I don’t want the model to learn to output some of these proper noun entities.

sentiment analysis

Sentiments and emotions are inevitably very important underlying features of any text. The sentiment of a sentence is usually highly relevant in translating to the target variable for common classification tasks. These are also great post-processing filters for most industrial applications — you usually don’t want to show negative sentiment sentences to a client.

Which library to use for sentiment analysis ?

There are many libraries, models and datasets that you can use to calculate it. I usually find myself choosing one of these two — Textblob and Flair. My decision depends on a simple evaluation, how important is this feature for my task? Textblob is a simple sentiment analyzer that uses a dictionary-based model to output 2 metrics — subjectivity and sentiment. It is CPU based and very fast because of this simpler modeling. Where it lacks is — the more complex sentences like “the restaurant was great, but not enough”.

On the other hand, Flair uses deep-learning based model to compute sentiments. Recently, it uses a DistilBERT model for the task. It’s reasonably accurate, though it’s quite heavy and slower to compute. You will need a GPU for any reasonable dataset size.

Sentence Vectors

Just as word vectors, you can represent sentences or paragraphs with vectors in the corresponding latent space. I use this technique quite often, especially in unsupervised tasks for clustering and similarity/diversity-based selection tasks. The naive way to get word vectors is averaging of non-stopword word vectors in a sentence — but it loses relevance very fast as the sentence becomes longer. My goto library for this is Sentence Transformers. It is a BERT based model that's finetuned in a siamese network setting to optimize for similarity-based tasks on sentence vectors. Also, it’s faster compared to BERT. For classification tasks, the sentence vectors can be used as input to other classifiers, say Random Forests, after passing through a dimensionality reduction algorithm(like PCA).

These are all the commonly used and easily available features for textual datasets. Would love to expand on this knowledge if you can add any in the comments section. Leave any reviews here or on DMs. All the best for hacking datasets!