Natural Language Processing (NLP) with PySpark
“Pyspark is the best choice for handling large datasets Thanks to its powerful capabilities that combine the excellence of Apache Spark with the ease and expressiveness of Python.”
why spark ? ….wait let me explain you :
“PySpark is the bridge that transforms big data into actionable insights, offering the power of Apache Spark with the ease and expressiveness of Python.” well in simple words its Fast in everything if you consider data pre-processing or may be cleaning loading extract ,Transform its provides ApI .Allows multiple CPUs to work on different parts of the task Simultaneously.
Why Not Data Bricks …
As pyspark is an open source distributed computing Frame Work which supports multiple programming languages
why Python ?
Easy to Learn: For programmers Python is comparatively easier to learn because of its syntax and standard libraries. Moreover, it’s a dynamically typed language, which means RDDs can hold objects of multiple types.
Lets install and use it
#create your conda enviornment in Anconda promt
#Remember to install python
conda create -n salvaenv
conda activate salvaenv
conada intsall pyspark #install pyspark
#To test if your pyspark is insatlled or not
import pyspark #your pyspark is ready
#To use pyspark on ur Code Editor we need to import SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySparkExample").getOrCreate() #we need to acctivate the Spark Session
#loading the data set
df=spark.read_csv("your file name",header=True.inferSchema=True)
#Header = which makes the first rows as headings
#inferSchema: takes its own data types depending up on data
df.show() #to view the data
#lets get the lenght of words for each row
df=df.withcolumn("length",length(df["text"])) #length key words acts as len
Classification SMS span
predicting the given SMS is span or not span lets do a basic data analysis by finds the length of SMS ham and spam
PRE-PROCESSING :It helps to remove all meaning less words and helps to reduce the memory and tokenizes them and convert them in to Numeric form so that machine can understand
from pyspark.ml.features import Tokenzier,Stopwordsremove,CountVectorizer,stringIndexer
In PySpark, the
StringIndexer
is a feature transformer used for encoding categorical (string) columns into numerical values.
VectorAssembler
is a feature transformer that is commonly used for feature engineering in machine learning pipelines. Its primary purpose is to combine multiple feature columns into a single vector column.
Lets build the model :
from pyspark.ml.Classification import NavieBayes
nb=NavieBayes() #As it is Binary Classification it performs better with binary Classes
Train and Evaluate the Model :😊
Conclusion 👍:In this article,I explored the powerful combination of PySpark and Natural Language Processing (NLP) for analyzing text data. Using PySpark, we preprocessed text, extracted features, performed classification and sentiment analysis, and even implemented topic modeling and summarization techniques. This approach allowed us to gain valuable insights from our text data, such as classifying text, understanding sentiment, identifying key topics, and generating concise summaries.