Is Parquet Faster than CSV for sentiment analysisđ±?
If you see the below Graph it directly shows that Parquet consumes very less memory than others, but why?
Let see some of the features of CSV and Parquet files.
The only problem with the Parquet file is we canât read it manually like how we can read CSV, it is because it is stored in a columnar manner.
Note: Using the Flair module for the Sentiment analysis.
%%capture
!pip install flair
This will install your Flair module, if you already have it installed then you can skip this part.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from flair.models import TextClassifier
from flair.data import Sentence
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 2000)
pd.set_option('display.max_colwidth', None)
pd.options.display.float_format = '{:.4f}'.format
Import all the libraries you will need for this.
Reading the dataset
I am using the IMDB reviews dataset from kaggle.
URL for IMDB dataset: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
//Reading the dataset
df_review = pd.read_csv("../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
print("Dimension of the Review data -",df_review.shape[0])
display(df_review.head())
Checking for NULL values
print("Percentage of Missing values in the Reviews data - \n")print(df_review.isnull().sum() / df_review.shape[0] * 100.00)
So there are no missing values in any column.
Sentiment Analysis using Flair
#It will be used by the cal_score functiondef senti_score(n): s=Sentence(n) classifier.predict(s) total_sentiment=s.labels[0] assert total_sentiment.value in ["POSITIVE","NEGATIVE"] sign=1 if total_sentiment.value == "POSITIVE" else -1 score=total_sentiment.score return sign*score#classify them as positive, negative and neutraldef sentiment_type(score): if score<= 0: return "Negative" elif score>0: return "Positive"#Calculate the scores for the sentimentdef cal_score(text_col): scores=[] new_textcol=text_col.tolist() for i in tqdm(new_textcol): s=senti_score(i) scores.append(s) return scores
Selecting only the top 100 reviews
df_top_500 = df_review.iloc[0:100]
df_top_500a = df_review.iloc[0:100]
print("Dimension - ",df_top_500.shape)
Applying sentiment analysis on CSV file with 100 reviews
classifier=TextClassifier.load('en-sentiment')
#Calculating the flair score in a new column
df_top_500['Flair Score']=cal_score(df_top_500['review'])
#Applying the function
df_top_500["Sentiment_Type"]=df_top_500["Flair Score"].apply(sentiment_type)
Flair took 27 sec to process 100 reviews using CSV format.
Applying sentiment analysis on Parquet file with 100 reviews
#Convert into parquet format
df_top_500a.to_parquet('df.parquet.gzip', compression='gzip')
#Reading the parquet file
df_parquet = pd.read_parquet('df.parquet.gzip')
#Calculating the flair score in a new column
df_parquet['Flair Score']=cal_score(df_parquet['review'])
#Applying the function
df_parquet["Sentiment_Type"]=df_parquet["Flair Score"].apply(sentiment_type)
Flair took 27 sec to process 100 reviews using Parquet format too.
Testing again with top 500 reviews
df_top_500 = df_review.iloc[0:500]
df_top_500a = df_review.iloc[0:500]
print("Dimension - ",df_top_500.shape)
Applying sentiment analysis on CSV file with 500 reviews
#Calculating the flair score in a new column
df_top_500['Flair Score']=cal_score(df_top_500['review'])
#Applying the function
df_top_500["Sentiment_Type"]=df_top_500["Flair Score"].apply(sentiment_type)
Flair took 2 min 15 sec to process 500 reviews using CSV format
Applying sentiment analysis on Parquet file with 500 reviews
#Convert into parquet format
df_top_500a.to_parquet('df.parquet.gzip', compression='gzip')
#Reading the parquet file
df_parquet = pd.read_parquet('df.parquet.gzip')
#Calculating the flair score in a new column
df_parquet['Flair Score']=cal_score(df_parquet['review'])
#Applying the function
df_parquet["Sentiment_Type"]=df_parquet["Flair Score"].apply(sentiment_type)
Flair took 2 min 17 sec to process 500 reviews using Parquet format
Testing again with top 1000 reviews
df_top_500 = df_review.iloc[0:1000]
df_top_500a = df_review.iloc[0:1000]
print("Dimension - ",df_top_500.shape)
Applying sentiment analysis on CSV file with 1000 reviews
#Calculating the flair score in a new column
df_top_500['Flair Score']=cal_score(df_top_500['review'])
#Applying the function
df_top_500["Sentiment_Type"]=df_top_500["Flair Score"].apply(sentiment_type)
Flair took 4 min 24 sec to process 500 reviews using CSV format.
Applying sentiment analysis on Parquet file with 1000 reviews
#Convert into parquet format
df_top_500a.to_parquet('df.parquet.gzip', compression='gzip')
#Reading the parquet file
df_parquet = pd.read_parquet('df.parquet.gzip')
#Calculating the flair score in a new column
df_parquet['Flair Score']=cal_score(df_parquet['review'])
#Applying the function
df_parquet["Sentiment_Type"]=df_parquet["Flair Score"].apply(sentiment_type)
Flair took 4 min 24 sec to process 500 reviews using the Parquet format.
Final summary
For 100 reviews -
- Flair took 27 sec to process 100 reviews using CSV format
- Flair took 27 sec to process 100 reviews using the Parquet format
For 500 reviews -
- Flair took 2 min 15 sec to process 500 reviews using CSV format
- Flair took 2 min 17 sec to process 500 reviews using the Parquet format
For 1000 reviews -
- Flair took 4 min 24 sec to process 500 reviews using CSV format
- Flair took 4 min 24 sec to process 500 reviews using the Parquet format
We are seeing here that as the number of reviews is increasing the Parquet file is catching the CSV file Speed or might be if we keep it doing for 2000 or 5000 reviews the parquet will beat CSV.
If this article helped you donât forget to like and share it with your friendsđHappy Learning!!