Text Classification using Apache Spark MLlib | Towards AI
Multi-Class Text Classification Using PySpark, MLlib & Doc2Vec
How to use Apache Spark MLlib with PySpark for NLP problems and how to simulate Doc2Vec in Spark MLlib
Apache Spark nowadays is quite popular to scale up any data processing application. For Machine Learning also, it provides a library called ‘MLlib’ . It is a distributed programming approach to solve ML problems. In this article, we will see how to integrate this MLlib with PySpark and techniques of using Doc2Vec with PySpark for solving text classification problems.
Before going ahead, we need to know what is ‘Doc2Vec’. It is an NLP model to describe a text or document. It converts a text into a vector of numerical features to be used in any ML algorithm. Basically, it is a feature engineering technique. It tries to understand the context of documents by random sampling of words and trains a neural network with those. Hidden layer vectors of the neural network become document vectors a.k.a ‘Doc2Vec’. There is another technique called ‘Word2Vec’ which also works on similar principals. But instead of documents/texts, it works on word corpus and provides vectors for words. For more details on ‘Doc2Vec’ & ‘Word2Vec’ the following resources will be helpful:
- Doc2Vec Tutorial by Rare Technologies
- Distributed Representations of Sentences & Documents
- Beginners Guide to Word2Vec
Setting up PySpark
PySpark local setup is required for this article. We will use PySpark 2.4.3 and Python 2.7 for compatibility reasons and will set sufficient memory for this application. We can achieve this like below:
We can see that spark object now
Data Exploration & Problem formulation
We will use the ‘Sentence Classification Set’ from the UCI Machine Learning Repository. This one contains a total of 3297 labeled sentences spread across different files. Each sentence is assigned a specific category. So, clearly, it is a typical text classification problem.
Let’s first see what’s there in the dataset!
Total number of records
As we can see, dataset contains some texts which are unnecessary like ‘### abstract ###’ & ‘### introduction ###’ . These texts are nothing but comments on those files. This dataset is not yet divided into separate ‘label’ & ‘content’ column which is very common for classification problems. So, this has to be cleaned & divided into proper columns for further processing.
Let’s see how can we do that.
input_df = input_rdd.toDF() input_df.show()
‘_1’ is the label and ‘_2’ is the actual text for our problem. Now we can use this dataset for further processing and actual problem-solving.
Let’s see how many different categories are there for sentences i.e. how many different values for column ‘_1’
So, there is a total of 5 different classes/categories and it is a 5-class text classification problem.
Basic Text Cleaning
Before jumping into ‘Doc2Vec’ processing, basic text cleaning is necessary. A typical text cleaning involves the following steps
- Conversion to lowercase
- Removal of punctuations
- Removal of integers, numbers
- Removal of extra spaces
- Removal of tags (like <html>, <p> etc)
- Removal of stop words (like ‘and’, ‘to’, ‘the’ etc)
- Stemming (Conversion of words to root form)
We will use Python ‘gensim’ library for all text cleaning.
Let’s see the content of a particular sentence and how does this ‘clean_text’ function work on it
Though the ‘cleaned’ sentence is not grammatically correct anymore, still it holds the context which is very essential for ‘Doc2Vec’ processing
Now, let’s see how can we use this function in PySpark to clean all of the sentences
cleaned_rdd = input_rdd.map(lambda x : clean_text(x))cleaned_df = cleaned_rdd.toDF() cleaned_df.show()
Machine Learning Pipeline
Now, it’s time to do actual work. As of now, Apache Spark does not provide any API for ‘Doc2Vec’. But it provides a ‘Word2Vec’ transformer. It is based on the ‘Skip-Gram’ approach. As per the Apache Spark documentation:
Word2VecModel transforms each document into a vector using the average of all words in the document
Let’s say, for our use case, one sentence has 5 words. Then, for example, a typical ‘Word2Vec’ will convert each word into a feature vector of size 100. In this case, a ‘Doc2Vec’ representation will be average of all these 100 length vectors and its length will also be 100. This is a simplified ‘average-out’ scheme of the ‘Doc2Vec’ model. We will use this average schemed ‘Word2Vec’ of Apache Spark as our ‘Doc2Vec’ model.
Our Machine Learning pipeline will consist of two stages
- A Tokenizer
- A ‘Word2Vec’ model
We will use Apache Spark Pipeline API for this.
Let’s print ‘Doc2Vec’ contents
‘features’ column is the actual ‘Doc2Vec’ dense vectors. We have used ‘Doc2Vec’ of size 300. Generally, the preferred size is kept between 100 and 300.
Model Training & Evaluation
The next step will be, putting these ‘Doc2Vec’ features into the classifier model. We will try with the RandomForest & LogisticRegression model.
We need to split the data into training & test set and evaluate the model accuracy
w2v_train_df, w2v_test_df = doc2vecs_df.randomSplit([0.8, 0.2])
Spark MLlib does not understand typical categorical variables. For that our class labels (column ‘_1’) have to be converted into indices. ‘StringIndexer’ API does that for us.
Here also, we have to build a pipeline with the following stages
- StringIndexer (input = ‘_1’, output = ‘label’)
- RandomForest Classifier (label column = ‘label’, features column = ‘features’. This ‘features’ is coming from ‘Doc2Vec’ transformation)
accuracy = rf_model_evaluator.evaluate(rf_predictions) print("Accuracy = %g" % (accuracy))
For LogisticRegression also, the same kind of pipeline stages are applicable.
accuracy = lr_model_evaluator.evaluate(lr_predictions) print("Accuracy = %g" % (accuracy))
We can see that accuracy wise both classifiers are more or less the same. Accuracy can be further improved by ‘Hyperparameter’ tuning and changing the classifier. This is out of the scope of this article. Readers of this article can try this on their own.
This brings us to the end. We learned how to use Spark MLlib with PySpark, simulate Doc2Vec, build pipelines.
Jupyter Notebook of this article can be found on Github. Any feedback will be highly appreciated.