Natural Language Processing (NLP) with PySpark

Salvapathi Naidu Manapaka
3 min readNov 5, 2023

--

“Pyspark is the best choice for handling large datasets Thanks to its powerful capabilities that combine the excellence of Apache Spark with the ease and expressiveness of Python.”

why spark ? ….wait let me explain you :

“PySpark is the bridge that transforms big data into actionable insights, offering the power of Apache Spark with the ease and expressiveness of Python.” well in simple words its Fast in everything if you consider data pre-processing or may be cleaning loading extract ,Transform its provides ApI .Allows multiple CPUs to work on different parts of the task Simultaneously.

Why Not Data Bricks …

As pyspark is an open source distributed computing Frame Work which supports multiple programming languages

why Python ?

Easy to Learn: For programmers Python is comparatively easier to learn because of its syntax and standard libraries. Moreover, it’s a dynamically typed language, which means RDDs can hold objects of multiple types.

Works with 100X speed like “lighting mequeen”

Lets install and use it

#create your conda enviornment in Anconda promt
#Remember to install python
conda create -n salvaenv
conda activate salvaenv
conada intsall pyspark #install pyspark
#To test if your pyspark is insatlled or not
import pyspark #your pyspark is ready
#To use pyspark on ur Code Editor we need to import SparkSession
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PySparkExample").getOrCreate() #we need to acctivate the Spark Session
#loading the data set
df=spark.read_csv("your file name",header=True.inferSchema=True)
#Header = which makes the first rows as headings
#inferSchema: takes its own data types depending up on data
df.show() #to view the data
#lets get the lenght of words for each row
df=df.withcolumn("length",length(df["text"])) #length key words acts as len

Classification SMS span

predicting the given SMS is span or not span lets do a basic data analysis by finds the length of SMS ham and spam

This proves Span has larger length of words when compared to span

PRE-PROCESSING :It helps to remove all meaning less words and helps to reduce the memory and tokenizes them and convert them in to Numeric form so that machine can understand

  from pyspark.ml.features import Tokenzier,Stopwordsremove,CountVectorizer,stringIndexer
ALL THE PREPROCESSING AND VECTOR ASSEMBLE IS Done

In PySpark, the StringIndexer is a feature transformer used for encoding categorical (string) columns into numerical values.

VectorAssembler is a feature transformer that is commonly used for feature engineering in machine learning pipelines. Its primary purpose is to combine multiple feature columns into a single vector column.

JUST Convert to Numbers

Lets build the model :

from pyspark.ml.Classification import NavieBayes
nb=NavieBayes() #As it is Binary Classification it performs better with binary Classes

The Pipeline code ..
The Pipeline Structure

Train and Evaluate the Model :😊

Conclusion 👍:In this article,I explored the powerful combination of PySpark and Natural Language Processing (NLP) for analyzing text data. Using PySpark, we preprocessed text, extracted features, performed classification and sentiment analysis, and even implemented topic modeling and summarization techniques. This approach allowed us to gain valuable insights from our text data, such as classifying text, understanding sentiment, identifying key topics, and generating concise summaries.

--

--