Is Parquet Faster than CSV for sentiment analysis😱?

5 min readApr 2, 2022

If you see the below Graph it directly shows that Parquet consumes very less memory than others, but why?

Let see some of the features of CSV and Parquet files.

The only problem with the Parquet file is we can’t read it manually like how we can read CSV, it is because it is stored in a columnar manner.

Note: Using the Flair module for the Sentiment analysis.

%%capture
!pip install flair

This will install your Flair module, if you already have it installed then you can skip this part.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from flair.models import TextClassifier
from flair.data import Sentence
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 2000)
pd.set_option('display.max_colwidth', None)
pd.options.display.float_format = '{:.4f}'.format

Import all the libraries you will need for this.

Reading the dataset

I am using the IMDB reviews dataset from kaggle.

URL for IMDB dataset: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

//Reading the dataset
df_review = pd.read_csv("../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
print("Dimension of the Review data -",df_review.shape[0])
display(df_review.head())

Checking for NULL values

print("Percentage of Missing values in the Reviews data - \n")print(df_review.isnull().sum() / df_review.shape[0] * 100.00)

So there are no missing values in any column.

Sentiment Analysis using Flair

#It will be used by the cal_score functiondef senti_score(n):    s=Sentence(n)    classifier.predict(s)    total_sentiment=s.labels[0]    assert total_sentiment.value in ["POSITIVE","NEGATIVE"]    sign=1 if total_sentiment.value == "POSITIVE" else -1    score=total_sentiment.score    return sign*score#classify them as positive, negative and neutraldef sentiment_type(score):    if score<= 0:        return "Negative"    elif score>0:        return "Positive"#Calculate the scores for the sentimentdef cal_score(text_col):    scores=[]    new_textcol=text_col.tolist()    for i in tqdm(new_textcol):        s=senti_score(i)        scores.append(s)    return scores

Selecting only the top 100 reviews

df_top_500 = df_review.iloc[0:100]
df_top_500a = df_review.iloc[0:100]
print("Dimension - ",df_top_500.shape)

Applying sentiment analysis on CSV file with 100 reviews

classifier=TextClassifier.load('en-sentiment')
#Calculating the flair score in a new column
df_top_500['Flair Score']=cal_score(df_top_500['review'])
#Applying the function
df_top_500["Sentiment_Type"]=df_top_500["Flair Score"].apply(sentiment_type)

Flair took 27 sec to process 100 reviews using CSV format.

Applying sentiment analysis on Parquet file with 100 reviews

#Convert into parquet format
df_top_500a.to_parquet('df.parquet.gzip', compression='gzip')
#Reading the parquet file
df_parquet = pd.read_parquet('df.parquet.gzip')
#Calculating the flair score in a new column
df_parquet['Flair Score']=cal_score(df_parquet['review'])
#Applying the function
df_parquet["Sentiment_Type"]=df_parquet["Flair Score"].apply(sentiment_type)

Flair took 27 sec to process 100 reviews using Parquet format too.

Testing again with top 500 reviews

df_top_500 = df_review.iloc[0:500]
df_top_500a = df_review.iloc[0:500]
print("Dimension - ",df_top_500.shape)

Applying sentiment analysis on CSV file with 500 reviews

#Calculating the flair score in a new column
df_top_500['Flair Score']=cal_score(df_top_500['review'])
#Applying the function
df_top_500["Sentiment_Type"]=df_top_500["Flair Score"].apply(sentiment_type)

Flair took 2 min 15 sec to process 500 reviews using CSV format

Applying sentiment analysis on Parquet file with 500 reviews

#Convert into parquet format
df_top_500a.to_parquet('df.parquet.gzip', compression='gzip')
#Reading the parquet file
df_parquet = pd.read_parquet('df.parquet.gzip')
#Calculating the flair score in a new column
df_parquet['Flair Score']=cal_score(df_parquet['review'])
#Applying the function
df_parquet["Sentiment_Type"]=df_parquet["Flair Score"].apply(sentiment_type)

Flair took 2 min 17 sec to process 500 reviews using Parquet format

Testing again with top 1000 reviews

df_top_500 = df_review.iloc[0:1000]
df_top_500a = df_review.iloc[0:1000]
print("Dimension - ",df_top_500.shape)

Applying sentiment analysis on CSV file with 1000 reviews

#Calculating the flair score in a new column
df_top_500['Flair Score']=cal_score(df_top_500['review'])
#Applying the function
df_top_500["Sentiment_Type"]=df_top_500["Flair Score"].apply(sentiment_type)

Flair took 4 min 24 sec to process 500 reviews using CSV format.

Applying sentiment analysis on Parquet file with 1000 reviews

#Convert into parquet format
df_top_500a.to_parquet('df.parquet.gzip', compression='gzip')
#Reading the parquet file
df_parquet = pd.read_parquet('df.parquet.gzip')
#Calculating the flair score in a new column
df_parquet['Flair Score']=cal_score(df_parquet['review'])
#Applying the function
df_parquet["Sentiment_Type"]=df_parquet["Flair Score"].apply(sentiment_type)

Flair took 4 min 24 sec to process 500 reviews using the Parquet format.

Final summary

For 100 reviews -

Flair took 27 sec to process 100 reviews using CSV format
Flair took 27 sec to process 100 reviews using the Parquet format

For 500 reviews -

Flair took 2 min 15 sec to process 500 reviews using CSV format
Flair took 2 min 17 sec to process 500 reviews using the Parquet format

For 1000 reviews -

Flair took 4 min 24 sec to process 500 reviews using CSV format
Flair took 4 min 24 sec to process 500 reviews using the Parquet format

We are seeing here that as the number of reviews is increasing the Parquet file is catching the CSV file Speed or might be if we keep it doing for 2000 or 5000 reviews the parquet will beat CSV.

If this article helped you don’t forget to like and share it with your friends👍Happy Learning!!