(Not IMDb) Movie Reviews Dataset EDA
This is the next article in a series where I work with the dataset from a popular online database of information related to movies and television series and use this data to practice different data science techniques and improve my skills.
Here I’m going to do Exploratory Data Analysis (EDA) of my dataset. After doing EDA we will be ready to build some ML models. For example, perform sentiment analysis.
Table of content
- Reading data
— Dataset Overview
— Marking Quality Assessment - Exploratory Data Analysis
— General Figures
— Target
— Users
— Shows
— Date and Time of Review
— Subtitle
— Review
— Review Score
— Score
— Interactions
Reading data
First part of the process is always reading the data
reviews = pd.read_parquet(os.path.join(cleaned_reviews_path,
"reviews.parquet"))
reviews.info()
After reading datasets I always use info
method to print information about the Data Frame including the index dtype and columns, non-null values and memory usage.
The subtitle
and score
columns have null values. In this article we are not going to impute them.
Dataset overview
reviews.sample(n=10, random_state=SEED)
Usually I look at random rows to grasp the dataset’s diversity. In my opinion, it is better than printing the first/last N
rows.
Marking Quality Assessment
Let’s print a few random samples to see if the sentiment label corresponds to the sentiment of the review:
for sentiment, text in (
reviews[["sentiment", "review"]].sample(30, random_state=SEED).values
):
print(f"Sentiment: {sentiment}")
pprint(text, width=180)
print("\n")
After analysing the sentiment-review pairs it turned out that neutral reviews are hard to classify even manually, so let’s get rid of them for the sentiment classification task. Later we can try to classify text as neutral using predicted probabilities.
reviews = reviews.query("sentiment != 'neutral'")
reviews["sentiment"] = reviews["sentiment"].cat.remove_unused_categories()
reviews.shape
# (175551, 9)
Exploratory Data Analysis
General figures
Overall, users left 175 551 reviews.
reviews.shape[0]
The number of unique users that left reviews is 61 703 While the number of unique shows (movies/series) is 1 851.
reviews[["show_id", "user_id"]].nunique()
144 854 reviews were left for movies, which is approximately 83%.
reviews["type"].value_counts()
Now, I’m going to go through all the features, target and also look at interactions of the features and the target.
Target
sentiment
Let’s look at sentiment
by show type.
sentiment_by_type = (
reviews.groupby(["type"])["sentiment"]
.value_counts(normalize=True)
.mul(100)
.rename("percent")
.reset_index()
.rename(columns={"level_1": "sentiment"})
.round(1)
)
sentiment_by_type
We can see that the proportions roughly the same for movies and series. People tend to leave positive reviews (83%-85%). The shares of neutral and negative reviews are almost the same (around 15%-17%).
Users
user_id
Let’s look at the activity of the users.
number_of_reviews_per_user_per_type = (
reviews.groupby("type")
.agg({"user_id": "value_counts"})
.rename(columns={"user_id": "reviews_per_user"})
.reset_index()
)
number_of_reviews_per_user_per_type.groupby("type")["reviews_per_user"]
.describe(
percentiles=[0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.999]
).round(1)
As we can see, on average, users leave slightly more than 2 movie reviews, but the median value is equal to 1 review. It means that we have a lot of outliers — people, who leave enormous amounts of reviews.
number_of_reviews_per_user_per_type_cut =
number_of_reviews_per_user_per_type[
number_of_reviews_per_user_per_type["reviews_per_user"] <= 7
]
plot_per_type(
dataframe=number_of_reviews_per_user_per_type_cut,
column="reviews_per_user",
title="Number of reviews per user",
title_shift=1.05,
bins=8
)
After hiding the outliers we can look at the distribution of number of reviews. We can see that most of the users leave only a few reviews.
Shows
show_id
Let’s analyze the popularity of the shows.
number_of_reviews_per_user_per_type = (
reviews.groupby("type")
.agg({"user_id": "value_counts"})
.rename(columns={"user_id": "reviews_per_user"})
.reset_index()
)
number_of_reviews_per_user_per_type.groupby("type")["reviews_per_show"]
.describe(
percentiles=[0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.999]
).round(1)
On average, people leave approximately 150 reviews per movie and 46 reviews per series.
number_of_reviews_per_user_per_type_cut = number_of_reviews_per_user_per_type[
number_of_reviews_per_user_per_type["reviews_per_show"] <= 256
]
plot_per_type(
dataframe=number_of_reviews_per_user_per_type,
column="reviews_per_show",
title="Number of reviews per show",
bins=40,
title_shift=1.05
)
Here we see the expected pattern — there are only a few movies/series, which have a lot of reviews.
Date and time of review
datetime
First, let’s create additional features based on datetime
.
reviews["hour"] = reviews["datetime"].dt.hour
reviews["weekday"] = reviews["datetime"].dt.weekday
reviews["month"] = reviews["datetime"].dt.month - 1
Hour of review
plot_dt_per_type(
dataframe=reviews,
column="hour",
title="Hour distribution",
title_shift=1.0
)
People tend to publish reviews closer to the night with absolute maximum around 20–23 o’clock and absolute minimum around 4–7 o’clock. Although, I’m not sure whether time zone was taken into account (all data is +3 UTC) or these dates and times are distributed along 11 time zones.
Let’s think that we have dates and times, which were actual for the time zone where people located when they left their reviews.
Weekday of review
plot_dt_per_type(
dataframe=reviews,
column="weekday",
title="Weekday distribution",
bins=7,
figsize=(10, 4),
)
People tend to publish more reviews on Sundays, but between Mondays, Thursdays, Fridays and Saturdays the difference is not that significant.
Month of review
plot_dt_per_type(
dataframe=reviews,
column="month",
title="Weekday distribution",
bins=12,
figsize=(12, 4),
)
People prone to publish more reviews on winter months (especially on January). This may be due to the number of holidays in Russia in January, and also mainly due to the cold.
Subtitle
85% of reviews have subtitles.
reviews["subtitle"]
This field is not very interesting, so we are not going to analyze it further.
Review
We will look at the distribution of tokens for reviews to estimate the average length of review for sentiment classification in the future.
Distribution of Tokens
reviews["review"] = reviews["review"].str.replace("<p>", " ")
tokenizer = Tokenizer(max_length=None)
reviews["number_of_tokens"] = reviews["review"].parallel_apply(
lambda review:
tokenizer.tokenize(review, truncation=False)["input_ids"].shape[1]
)
Here we are using the tokenizer from the HuggingFace model, which I will show in subsequent articles.
Analysis
reviews.groupby("type")["number_of_tokens"].describe(
percentiles=[0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.999]
).round()
plot_per_type(
dataframe=reviews,
column="number_of_tokens",
title="Distribution of number of tokens in a review",
bins=20,
)
As we can see, some reviews are longer than 512 tokens, which is not good, because usually BERT-like architectures have a max length limit of tokens equal to 512.
Basically, we have two options in the future:
- Cut the longer texts off and only use the first/last 512 tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient.
- Split reviews in multiple subtexts, classify each of them and combine the results back together (choose the class which was predicted for most of the subtexts for example). This option is obviously more computationally expensive.
Frequency distribution of words
A word frequency distribution is the frequency (number of occurrences) of each word (or phrase) in a dataset.
Knowing that our baseline solution will be based on TF-IDF, it is important to take a look at top n-grams:
Review score
On this online platform users are allowed to leave feedback for others reviews. They can either upvote or downvote a review.
reviews.groupby("type")["review_score"].describe(
percentiles=[0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.999]
).round(1)
plot_per_type(
dataframe=reviews[
(reviews["review_score"] >= -11) & (reviews["review_score"] <= 64)
],
column="review_score",
title="Distribution of review scores",
bins=20,
)
Overall, more reviews are positively rated, but some reviews are so wrong — they have a very low user’s score.
Score
Last, but not least, we have a score, which I’ve extracted from the reviews before this analysis. Approximately 65% of reviews have such score.
plot_per_type(
dataframe=reviews,
column="score",
title="Distribution of review scores",
bins=10,
figsize=(10, 4),
)
As expected, people tend to leave high scores (me too, personally), because they either like the movie or not. And when they don’t like the movie they usually don’t include quantitative estimate of their dislike.
Interactions
Score and Sentiment
plot_catplot(
y="score",
x="sentiment",
hue="type",
data=reviews,
title="Relationship between score and sentiment",
)
Clearly, score
is a very good feature.
The median score
of negative reviews is very low (4 for movies and 3 for series). Positive reviews have the highest median score
(9 for movies and 10 for series).
That is why I’ve extracted scores from the reviews (where it was possible) — I don’t want them to influence the classifier.
Number of tokens in review and Sentiment
plot_catplot(
y="number_of_tokens",
x="sentiment",
hue="type",
data=reviews,
medianprops={},
title="Relationship between the number of tokens in review and
sentiment",
)
As we can see from the plot, the number of tokens (length of the review) is roughly the same for different sentiment of the review. So, using the length of the review directly as a feature is not an option — it will be useless.
Code to perform the research can be found here: