(Not IMDb) Movie Reviews Dataset EDA

7 min readNov 14, 2022

This is the next article in a series where I work with the dataset from a popular online database of information related to movies and television series and use this data to practice different data science techniques and improve my skills.

Here I’m going to do Exploratory Data Analysis (EDA) of my dataset. After doing EDA we will be ready to build some ML models. For example, perform sentiment analysis.

Table of content

Reading data
— Dataset Overview
— Marking Quality Assessment
Exploratory Data Analysis
— General Figures
— Target
— Users
— Shows
— Date and Time of Review
— Subtitle
— Review
— Review Score
— Score
— Interactions

Reading data

First part of the process is always reading the data

reviews = pd.read_parquet(os.path.join(cleaned_reviews_path, 
    "reviews.parquet"))
reviews.info()

After reading datasets I always use info method to print information about the Data Frame including the index dtype and columns, non-null values and memory usage.

The subtitle and score columns have null values. In this article we are not going to impute them.

Dataset overview

reviews.sample(n=10, random_state=SEED)

Usually I look at random rows to grasp the dataset’s diversity. In my opinion, it is better than printing the first/last N rows.

Marking Quality Assessment

Let’s print a few random samples to see if the sentiment label corresponds to the sentiment of the review:

for sentiment, text in (
    reviews[["sentiment", "review"]].sample(30, random_state=SEED).values
):
    print(f"Sentiment: {sentiment}")
    pprint(text, width=180)
    print("\n")

After analysing the sentiment-review pairs it turned out that neutral reviews are hard to classify even manually, so let’s get rid of them for the sentiment classification task. Later we can try to classify text as neutral using predicted probabilities.

reviews = reviews.query("sentiment != 'neutral'")
reviews["sentiment"] = reviews["sentiment"].cat.remove_unused_categories()

reviews.shape

# (175551, 9)

Exploratory Data Analysis

General figures

Overall, users left 175 551 reviews.

reviews.shape[0]

The number of unique users that left reviews is 61 703 While the number of unique shows (movies/series) is 1 851.

reviews[["show_id", "user_id"]].nunique()

144 854 reviews were left for movies, which is approximately 83%.

reviews["type"].value_counts()

Now, I’m going to go through all the features, target and also look at interactions of the features and the target.

Target

sentiment

Let’s look at sentiment by show type.

sentiment_by_type = (
    reviews.groupby(["type"])["sentiment"]
    .value_counts(normalize=True)
    .mul(100)
    .rename("percent")
    .reset_index()
    .rename(columns={"level_1": "sentiment"})
    .round(1)
)
sentiment_by_type

We can see that the proportions roughly the same for movies and series. People tend to leave positive reviews (83%-85%). The shares of neutral and negative reviews are almost the same (around 15%-17%).

Users

user_id

Let’s look at the activity of the users.

number_of_reviews_per_user_per_type = (
    reviews.groupby("type")
    .agg({"user_id": "value_counts"})
    .rename(columns={"user_id": "reviews_per_user"})
    .reset_index()
)

number_of_reviews_per_user_per_type.groupby("type")["reviews_per_user"]
  .describe(
      percentiles=[0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.999]
 ).round(1)

Descriptive statistics for users per show type

As we can see, on average, users leave slightly more than 2 movie reviews, but the median value is equal to 1 review. It means that we have a lot of outliers — people, who leave enormous amounts of reviews.

number_of_reviews_per_user_per_type_cut = 
    number_of_reviews_per_user_per_type[
        number_of_reviews_per_user_per_type["reviews_per_user"] <= 7
]

plot_per_type(
    dataframe=number_of_reviews_per_user_per_type_cut,
    column="reviews_per_user",
    title="Number of reviews per user",
    title_shift=1.05,
    bins=8
)

Distribution of number of reviews per show type

After hiding the outliers we can look at the distribution of number of reviews. We can see that most of the users leave only a few reviews.

Shows

show_id

Let’s analyze the popularity of the shows.

number_of_reviews_per_user_per_type = (
    reviews.groupby("type")
    .agg({"user_id": "value_counts"})
    .rename(columns={"user_id": "reviews_per_user"})
    .reset_index()
)

number_of_reviews_per_user_per_type.groupby("type")["reviews_per_show"]
  .describe(
      percentiles=[0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.999]
).round(1)

Descriptive statistics for shows per show type

On average, people leave approximately 150 reviews per movie and 46 reviews per series.

number_of_reviews_per_user_per_type_cut = number_of_reviews_per_user_per_type[
    number_of_reviews_per_user_per_type["reviews_per_show"] <= 256
]

plot_per_type(
    dataframe=number_of_reviews_per_user_per_type,
    column="reviews_per_show",
    title="Number of reviews per show",
    bins=40,
    title_shift=1.05
)

Distribution of number of shows per show type

Here we see the expected pattern — there are only a few movies/series, which have a lot of reviews.

Date and time of review

datetime

First, let’s create additional features based on datetime .

reviews["hour"] = reviews["datetime"].dt.hour
reviews["weekday"] = reviews["datetime"].dt.weekday
reviews["month"] = reviews["datetime"].dt.month - 1

Hour of review

plot_dt_per_type(
    dataframe=reviews,
    column="hour",
    title="Hour distribution",
    title_shift=1.0
)

Distribution of hours of publication of reviews

People tend to publish reviews closer to the night with absolute maximum around 20–23 o’clock and absolute minimum around 4–7 o’clock. Although, I’m not sure whether time zone was taken into account (all data is +3 UTC) or these dates and times are distributed along 11 time zones.

Let’s think that we have dates and times, which were actual for the time zone where people located when they left their reviews.

Weekday of review

plot_dt_per_type(
    dataframe=reviews,
    column="weekday",
    title="Weekday distribution",
    bins=7,
    figsize=(10, 4),
)

Distribution of weekday of publication of reviews

People tend to publish more reviews on Sundays, but between Mondays, Thursdays, Fridays and Saturdays the difference is not that significant.

Month of review

plot_dt_per_type(
    dataframe=reviews,
    column="month",
    title="Weekday distribution",
    bins=12,
    figsize=(12, 4),
)

Distribution of month of publication of reviews

People prone to publish more reviews on winter months (especially on January). This may be due to the number of holidays in Russia in January, and also mainly due to the cold.

Subtitle

85% of reviews have subtitles.

reviews["subtitle"]

This field is not very interesting, so we are not going to analyze it further.

Review

We will look at the distribution of tokens for reviews to estimate the average length of review for sentiment classification in the future.

Distribution of Tokens

reviews["review"] = reviews["review"].str.replace("<p>", " ")

tokenizer = Tokenizer(max_length=None)

reviews["number_of_tokens"] = reviews["review"].parallel_apply(
    lambda review: 
        tokenizer.tokenize(review, truncation=False)["input_ids"].shape[1]
)

Here we are using the tokenizer from the HuggingFace model, which I will show in subsequent articles.

Analysis

reviews.groupby("type")["number_of_tokens"].describe(
    percentiles=[0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.999]
).round()

Descriptive statistics for number of tokens per show type

plot_per_type(
    dataframe=reviews,
    column="number_of_tokens",
    title="Distribution of number of tokens in a review",
    bins=20,
)

Distribution of number of tokens per show type

As we can see, some reviews are longer than 512 tokens, which is not good, because usually BERT-like architectures have a max length limit of tokens equal to 512.

Basically, we have two options in the future:

Cut the longer texts off and only use the first/last 512 tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient.
Split reviews in multiple subtexts, classify each of them and combine the results back together (choose the class which was predicted for most of the subtexts for example). This option is obviously more computationally expensive.

Frequency distribution of words

A word frequency distribution is the frequency (number of occurrences) of each word (or phrase) in a dataset.

Knowing that our baseline solution will be based on TF-IDF, it is important to take a look at top n-grams:

Frequency distribution of n-grams with range=(1,1)

Frequency distribution of n-grams with range=(2,2)

Frequency distribution of n-grams with range=(3,3)

Review score

On this online platform users are allowed to leave feedback for others reviews. They can either upvote or downvote a review.

reviews.groupby("type")["review_score"].describe(
    percentiles=[0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.999]
).round(1)

Descriptive statistics for review scores

plot_per_type(
    dataframe=reviews[
        (reviews["review_score"] >= -11) & (reviews["review_score"] <= 64)
    ],
    column="review_score",
    title="Distribution of review scores",
    bins=20,
)

Overall, more reviews are positively rated, but some reviews are so wrong — they have a very low user’s score.

Score

Last, but not least, we have a score, which I’ve extracted from the reviews before this analysis. Approximately 65% of reviews have such score.

plot_per_type(
    dataframe=reviews,
    column="score",
    title="Distribution of review scores",
    bins=10,
    figsize=(10, 4),
)

As expected, people tend to leave high scores (me too, personally), because they either like the movie or not. And when they don’t like the movie they usually don’t include quantitative estimate of their dislike.

Interactions

Score and Sentiment

plot_catplot(
    y="score",
    x="sentiment",
    hue="type",
    data=reviews,
    title="Relationship between score and sentiment",
)

Clearly, score is a very good feature.

The median score of negative reviews is very low (4 for movies and 3 for series). Positive reviews have the highest median score (9 for movies and 10 for series).

That is why I’ve extracted scores from the reviews (where it was possible) — I don’t want them to influence the classifier.

Number of tokens in review and Sentiment

plot_catplot(
    y="number_of_tokens",
    x="sentiment",
    hue="type",
    data=reviews,
    medianprops={},
    title="Relationship between the number of tokens in review and 
        sentiment",
)

Relationship between number of tokens and sentiment

As we can see from the plot, the number of tokens (length of the review) is roughly the same for different sentiment of the review. So, using the length of the review directly as a feature is not an option — it will be useless.

Code to perform the research can be found here: