Text Classification in Python: Pipelines, NLP, NLTK, Tf-Idf, XGBoost and more
In this first article about text classification in Python, I’ll go over the basics of setting up a pipeline for natural language processing and text classification. I’ll focus mostly on the most challenging parts I faced and give a general framework for building your own classifier.
The problem is very simple, taking training data represented by paragraphs of text, which are labeled as 1 or 0. For more background, I was working with corporate SEC filings, trying to identify whether a filing would result in a stock price hike or not. It’s very similar to sentiment analysis, only we have only two classes: Positive and Neutral (which also includes Negative).
As an additional example, we add a feature to the text which is the number of words, just in case the length of a filing has an impact on our results — but it’s more to demonstrate using a FeatureUnion in the Pipeline. Skipping over loading the data (you can use CSVs, text files, or pickled information), we extract the training and test sets for Pandas data:
X = df[['Text', 'TotalWords']]
Y = df['Label']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)
Building a pipeline
While you can do all the processing sequentially, the more elegant way is to build a pipeline that includes all the transformers and estimators. I’ll post the pipeline definition first, and then I’ll go into step-by-step details:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifierclassifier = Pipeline([
('tfidf', TfidfVectorizer(tokenizer=Tokenizer, stop_words=stop_words,
min_df=.0025, max_df=0.25, ngram_range=(1,3))),
('svd', TruncatedSVD(algorithm='randomized', n_components=300)), #for XGB
('clf', XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.1)),
# ('clf', RandomForestClassifier()),
The reason we use a FeatureUnion is to allow us to combine different Pipelines that run on different features of the training data. Incorporating it into the main pipeline can be a bit finicky, but once you build your first one you’ll get the hang of it.
Each feature pipeline starts with a transformer which selects that specific feature. You can build quite complex transformers, but in this case we only need to select a feature. Transformers must only implement Transform and Fit methods. Here are the ones I use to extract columns of data (note that they’re different for text and numeric data):
from sklearn.base import BaseEstimator, TransformerMixinclass TextSelector(BaseEstimator, TransformerMixin):
def __init__(self, field):
self.field = field
def fit(self, X, y=None):
def transform(self, X):
return X[self.field]class NumberSelector(BaseEstimator, TransformerMixin):
def __init__(self, field):
self.field = field
def fit(self, X, y=None):
def transform(self, X):
We process the numeric columns with the StandardScaler, which standardizes the data by removing the mean and scaling to unit variance. This is a common requirement of machine learning classifiers. Most of them wouldn’t behave as expected if the individual features do not more or less look like standard normally distributed data.
Vectorizing text with the Tfidf-Vectorizer
The text processing is the more complex task, since that’s where most of the data we’re interested in resides. You can read ton of information on text pre-processing and analysis, and there are many ways of classifying it, but in this case we use one of the most popular text transformers, the TfidfVectorizer.
Compared to a Count Vectorizer, which just counts the number of occurrences of each word, Tf-Idf takes into account the frequency of a word in a document, weighted by how frequently it appears in the entire corpus. Common words like “the” or “that” will have high term frequencies, but when you weigh them by the inverse of the document frequency, that would be 1 (because they appear in every document), and since TfIdf uses log values, that weight will actually be 0 since log 1 = 0. By comparison, if one document contains the word “soccer”, and it’s the only document on that topic out of a set of 100 documents, then the inverse frequency will be 100, so its Tf-Idf value will be boosted, signifying that the document is uniquely related to the topic of “soccer”. The TfidfVectorizer in sklearn will return a matrix with the tf-idf of each word in each document, with higher values for words which are specific to that document, and low (0) values for words that appear throughout the corpus.
You can play with the parameters, use GridSearch or other hyperparameter optimizers, but that would be the topic of another article. What the current parameters mean is: We select n-grams in the (1,3) range, meaning individual words, bigrams and trigrams; We restrict the ngrams to a distribution frequency across the corpus between .0025 and .25; And we use a custom tokenizer, which extracts only number-and-letter-based words and applies a stemmer. What a stemmer does is it reduces inflectional forms and derivationally related forms of a word to a common base form, so it reduces the feature space. For example, the Porter Stemmer we use here would reduce “saying”, “say”, “said” or “says” to just “say”. The resulting tokenizer is this:
words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
words = [porter_stemmer.stem(word) for word in words]
This is actually the only instance of using the NLTK library, a powerful natural language toolkit for Python. You can read the basics of what you can do with it, starting with installation instructions, from this comprehensive NLTK guide.
After vectorizing the text, if we use the XGBoost classifier we need to add the TruncatedSVD transformer to the pipeline. Its role is to perform linear dimensionality reduction by means of truncated singular value decomposition (SVD). It works on tf-idf matrices generated by sklearn doing what’s called latent semantic analysis (LSA). That is beyond the scope of this article, but keep in mind that you needed it for XGBoost to work, since it doesn’t accept sparse matrices. For other classifiers you can just comment it out.
And now we’re at the final, and most important step of the processing pipeline: the main classifier. In this example, we use XGBoost, one of the most powerful available classifiers, made famous by its long string of Kaggle competitions wins. You can try other ones too, which will probably do almost as good, feel free to play with several of them. In my experience and trials, RandomForestClassifier and LinearSVC had the best results from the other classifiers.
XGBoost stands for eXtreme Gradient Boosting and is an implementation of gradient boosting machines that pushes the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed. Specifically, it was engineered to exploit every bit of memory and hardware resources for the boosting. XGBoost offers several advanced features for model tuning, computing environments and algorithm enhancement. It is capable of performing the three main forms of gradient boosting (Gradient Boosting (GB), Stochastic GB and Regularized GB) and it is robust enough to support fine tuning and addition of regularization parameters.
Regarding XGBoost installation in Windows, that can be quite challenging, and most solutions I found online didn’t work. The only thing that worked and it’s quite simple is to download the appropriate .whl file for your environment from here, and then in the download folder run pip with that wheel, like:
pip install xgboost‑0.71‑cp27‑cp27m‑win_amd64.whl
Now all you have to do is fit the training data with the classifier and start making predictions! Here’s how you do it to fit and predict the test data:
preds = classifier.predict(X_test)
Analyzing the results
Analyzing a classifier’s performance is a complex statistical task but here I want to focus on some of the most common metrics used to quickly evaluate the results.
Most programmers, when they evaluate a machine learning algorithm, use the total accuracy score, which shows how many predictions were correct. This is very good, and most of your programming work will be to engineer the features, process the data, and tune the parameter to increase that number. But sometimes, that might not be the best measure. Let’s take this particular case, where we are classifying financial documents to determine whether the stock will spike (so we decide to buy), or not. For this reason, we’re interested in the positive predictions (where the algorithm will predict 1). A common visualization of this is the confusion matrix, let’s take one early example, before the algorithm was fine-tuned:
[ 72 32]]
On the first line, we have the number of documents labeled 0 (neutral), while the second line has positive (1) documents. Now the columns: First one has the 0 predictions and the second one has the documents classified as 1. So what the numbers above mean is:
[[ true_negatives false_positives]
[ false_negatives true_positives]]
So in our case, the false positives hurt us, because we buy stock but it doesn’t create a gain for us. That’s why we want to maximize the ratio between true and false positives, which is actually measured as tp / (tp+fp) and is called precision. Therefore, the precision of the 1 class is our main measure of success. In this example, that is over 50%, which is good because it means we’ll make more good trades than bad ones. The ratio between true positives and false negatives means missed opportunity for us. It doesn’t hurt us directly because we don’t lose money; we just don’t make it. We’d want to maximize it as well, but it’s not as important as the precision. That ratio, tp / (tp + fn) is called recall. To sum up all this numbers, sklearn offers us a classification report:
precision recall f1-score support 0 0.75 0.90 0.82 241
1 0.57 0.31 0.40 104avg / total 0.70 0.72 0.69 345
This confirms our calculations based on the confusion matrix. We get 57% precision (pretty good for starters!) and 31% recall (we miss most of the opportunities). The code to display the metrics is:
from sklearn.metrics import accuracy_score, precision_score, classification_report, confusion_matrix
print "Accuracy:", accuracy_score(y_test, preds)
print "Precision:", precision_score(y_test, preds)
print classification_report(y_test, preds)
print confusion_matrix(y_test, preds)
That concludes our introduction to text classification with Python, NLTK, Sklearn and XGBoost. In future stories we’ll examine ways to improve our algorithm, tune the hyperparameters, enhance the text features and maybe some auto-ML (yes, automating and automation).
Chris Fotache is an AI researcher with CYNET.ai based in New Jersey. He covers topics related to artificial intelligence in our life, Python programming, machine learning, computer vision, natural language processing and more.