Streaming Wars with Sentiment Analysis using Roberta model: Amazon Prime Video

Shrunali Suresh Salian
10 min readJun 5, 2023

--

The global market for online streaming is currently valued at over $500 billion and is projected to exceed the $1 trillion mark by 2027. This significant growth potential makes it a highly competitive and valuable market that companies are eager to be a part of.

The United States remains the largest market for all the OTT players in the industry. Till Q3 of 2022, Netflix remained the largest market share holder at 21% beating Amazon Prime Video at 19% in US. However, as of Q1 in 2023, Prime Video has the largest market share of any streaming platform in the US, weighing in at 21%, while Netlfix holds 20%.

In this article, we embark on a thorough analysis of the streaming powerhouse — Amazon Prime Video, delving into their strategies, strengths, weaknesses, and the unique offerings they bring to the table.

After a prolonged period of anticipation, Amazon Prime Video has succeeded in becoming the largest market share holder among streaming services in the United States. The streaming giant also holds the largest content as compared to the rest of the streaming services.

The article is based on the Kaggle dataset available for Amazon Prime Video. It’s important to note that the analysis and conclusions based on the dataset for Amazon Prime Video may not reflect the current trends and offerings of Amazon Prime Video. The dataset provides insights based on a specific time period and may not capture the most up-to-date information or changes in the streaming platform.

What are you mostly likely to watch on Amazon Prime?

For every six movies available on Amazon Prime Video, there is one tv show.

As of 2023, Amazon Prime Video has expanded its presence to over 200 countries worldwide, with the United States leading as the primary market, followed by India.

count_data = showtime['type'].value_counts()

# Create a horizontal bar chart
fig = go.Figure(data=go.Bar(
y=count_data.index,
x=count_data.values,
orientation='h',
marker=dict(color=['#1b2530', '#3EB8FF'])
))

# Set title and axis labels
fig.update_layout(
title='Content on Amazon Prime Video',
xaxis_title='Count',
yaxis_title='Type of Content', title_x = 0.5
)

# Show the plot
fig.show()

Who is dominating the Amazon market share?

Within 3 years of entering into the Indian market, Amazon has managed to find a niche for itself by producing content that’s more regional and appealing to the Indian audiences.

As compared to the other global markets, Amazon entered the Indian market much later in 2016 as compared to the United Kingdom in 2007, and Canada in 2013. It’s safe to say that Amazon received a warm reception in India.

fig = px.pie(top_countries, values='content_produced', names='country_name',
title='Contribution of Content Produced by Top 4 Countries on Amazon Prime Video')

# Change the color palette
fig.update_traces(marker=dict(colors=['#1b2530', '#0f79af', '#3EB8FF', '#f5f5f1','#ffffff']))

# Set the text position and information to be displayed
fig.update_traces(textposition='inside', textinfo='percent+label')

fig.show()

Do we really like movies that much, or is it just a myth?

While the rest of the world prefers short content the Asian countries are in it for the long format content

Amazon demonstrates a keen understanding of its audience preferences by effectively targeting different regions. Asian countries such as South Korea, China, and Japan are recognized for their inclination towards consuming dramatic long-format content. In contrast, audiences in other countries tend to have a greater preference for movies on Amazon Prime Video.

country_order = showtime['production_country'].value_counts()[:11].index
data = showtime[['type', 'production_country']].groupby('production_country')['type'].value_counts().unstack().loc[country_order]
data['sum'] = data.sum(axis=1)
data_ratio = (data.T / data['sum']).T[['MOVIE', 'SHOW']].sort_values(by='MOVIE',ascending=False)[::-1]
data_ratio = data_ratio.reset_index()
data_ratio = data_ratio[data_ratio['index'] != 'No Data']
data_ratio.rename(columns = {'index':'country_code'}, inplace = True)
data_ratio['MOVIE'] = round(data_ratio['MOVIE'], 2)
data_ratio['SHOW'] = round(data_ratio['SHOW'],2)
fig = go.Figure()

# Add horizontal bar traces for MOVIE and SHOW
fig.add_trace(go.Bar(
y=data_ratio.country_code,
x=data_ratio['MOVIE'],
name='MOVIE',
orientation='h',
marker=dict(color='#1b2530'),
text=(data_ratio['MOVIE'] * 100).astype(str) + '%', # Add text as percentages
textposition='inside', # Set text position inside the bars
textfont=dict(color='white') # Set text color
))

fig.add_trace(go.Bar(
y=data_ratio.country_code,
x=data_ratio['SHOW'],
name='SHOW',
orientation='h',
marker=dict(color='#3EB8FF'),
text=(data_ratio['SHOW'] * 100).astype(str) + '%', # Add text as percentages
textposition='inside', # Set text position inside the bars
textfont=dict(color='white') # Set text color
))

# Set the layout
fig.update_layout(
title="Amazon Prime Video's Content Distribution by Country",
barmode='stack',
yaxis_title='Top 10 Countries',
xaxis=dict(showticklabels=False), title_x = 0.5 # Hide the x-axis tick labels
)

fig.show()

What kind of a person are you: Drama or Comedy?

Drama is the name of the game 🤓

Drama holds the leading position among genres on Amazon Prime Video, closely followed by comedy, thriller, and documentaries. The primary reason behind this trend is the balance between supply and demand. The largest markets for Amazon Prime Video, namely the United States and India, exhibit a strong appetite for content rich in drama and comedy. Conversely, British audiences show a greater inclination towards documentaries.

genre_distribution = pd.DataFrame(showtime.groupby('primary_genre')['type'].value_counts())
genre_distribution = genre_distribution.unstack().reset_index().fillna(0).drop(0)
genre_distribution['SUM'] = genre_distribution.sum(axis = 1)
genre_distribution.columns = ['primary_genre', 'MOVIE', 'SHOW', 'total']
genre_distribution = genre_distribution.sort_values('total', ascending = False)

fig = go.Figure()
# Add vertical bar traces for MOVIE and SHOW
fig.add_trace(go.Bar(
x=genre_distribution['primary_genre'],
y=genre_distribution['MOVIE'],
name='MOVIE',
marker=dict(color='#1b2530'),
# text=df['MOVIE'].apply(lambda x: f'{x:.1f}%'),
# textposition='auto'
))

fig.add_trace(go.Bar(
x=genre_distribution['primary_genre'],
y=genre_distribution['SHOW'],
name='SHOW',
marker=dict(color='#3EB8FF'),
# text=df['SHOW'].apply(lambda x: f'{x:.1f}%'),
# textposition='auto'
))

# fig.update_traces(textposition='inside', textinfo='percent+label')

# Set the layout
fig.update_layout(
title="Amazon Prime Video's Content Distribution by Genre",
xaxis_title='Genre',
yaxis_title='Content on Amazon Prime Video',
barmode='stack', legend_title = 'Type of Content', title_x =0.5
)
fig.show()

For mature audiences only …

Looks like Amazon Prime Video is targeting mature audiences

The majority of content available on Amazon Prime Video falls under the R-rated category, comprising the largest portion. Following closely behind is the PG-13 rating, which accounts for approximately half of the content that falls under the R-rated classification.

rating_distribution = pd.DataFrame(showtime.groupby('age_certification')['type'].value_counts())
# rating_distribution = genre_distribution.unstack().reset_index().fillna(0).drop(0)
rating_distribution = rating_distribution.unstack().reset_index().fillna(0)
rating_distribution['SUM'] = rating_distribution.sum(axis = 1)
rating_distribution.columns = ['age_certification','MOVIE','SHOW','Total']
rating_distribution = rating_distribution.sort_values('Total', ascending = False).drop(2)
fig = go.Figure()

# Add vertical bar traces for MOVIE and SHOW
fig.add_trace(go.Bar(
x=rating_distribution['age_certification'],
y=rating_distribution['MOVIE'],
name='MOVIE',
marker=dict(color='#1b2530'),
# text=df['MOVIE'].apply(lambda x: f'{x:.1f}%'),
# textposition='auto'
))

fig.add_trace(go.Bar(
x=rating_distribution['age_certification'],
y=rating_distribution['SHOW'],
name='SHOW',
marker=dict(color='#3EB8FF'),
# text=df['SHOW'].apply(lambda x: f'{x:.1f}%'),
# textposition='auto'
))

# fig.update_traces(textposition='inside', textinfo='percent+label')

# Set the layout
fig.update_layout(
title='Content Distribution by Age Rating Certification on Amazon Prime Video',
xaxis_title='Genre',
yaxis_title='Content on Amazon Prime Video',
barmode='stack', legend_title = 'Type of Content', title_x = 0.5
)

fig.show()

How old is the content that you are watching?

The evolution of Prime Videos

Since 2015, the online streaming industry experienced a significant surge and expansion.

history = pd.DataFrame(showtime.groupby('release_year')['type'].value_counts())
history = history.unstack().reset_index().fillna(0)
# history['total'] = history.sum(axis = 1)
history.columns = ['release_year','MOVIE','SHOW']
history = history[(history['release_year'] >= 2000) & (history['release_year'] <= 2021)]
fig3 = go.Figure()
fig3.add_trace(go.Scatter(
x=history['release_year'],
y=history['MOVIE'],
mode='lines',
name='MOVIE',
fill='tozeroy',
line=dict(color='#1b2530')
))

fig3.add_trace(go.Scatter(
x=history['release_year'],
y=history['SHOW'],
mode='lines',
name='SHOW',
fill='tozeroy',
line=dict(color='#3EB8FF')
))

# Set the layout
fig3.update_layout(
title='Content Trend on Amazon Prime Video over the Years',
xaxis_title='Release Year',
yaxis_title='Content on Amazon Prime Video', showlegend = True, title_x =0.5
)
fig3.show()

Who is Amazon targeting in your country?

Amazon’s target audience is adults in America & Europe and older kids in Asia.

Amazon Prime Video primarily targets adult audiences in America and Europe, tailoring its content offerings to cater to their preferences. However, in Asian regions, Amazon adopts a different approach by targeting older children as its audience. This strategic shift in targeting explains the composition of content on Amazon Prime Video, with R-rated content comprising the largest portion, followed by a significant presence of PG-13 content.

import plotly.express as px

total_count = demographic_data['count'].sum()
demographic_data['percentage'] = (demographic_data['count'] / total_count) * 100

fig = px.treemap(demographic_data, path=['production_country', 'target_ages'], values='percentage',
color='target_ages', color_discrete_sequence=['#1b2530', '#0f79af', '#3EB8FF', '#ffffff'])

fig.update_layout(title= "Amazon Prime Video's Country-Level Target Audience",
margin=dict(l=20, r=20, t=40, b=20), title_x = 0.5) # Adjust the margins as needed

fig.show()

What are you watching on Amazon?

Can Amazon’s audience be called a sucker for dramatic content?

Upon analyzing the breakdown of content on Amazon Prime Video by country and genre, it becomes apparent that the platform offers a substantial amount of dramatic content, followed by comedy.

Interestingly, when examining specific preferences by country, it is observed that Indian audiences show a strong affinity for thriller and action genres, suggesting a preference for more intense and adrenaline-pumping content. On the other hand, British viewers exhibit a greater interest in documentaries, indicating a fascination for factual and informative programming.

We might watch more movies but we like shows better…

Relationship between IMDB and TMDB scores

Despite movies comprising a significant portion of the content available on Amazon Prime Video, it is noteworthy that TV shows tend to receive higher ratings. This suggests that the quality and appeal of the TV shows offered on the platform are often regarded more favorably by viewers compared to the movies.

fig = px.scatter(showtime, x='imdb_score', y='tmdb_score', color='type',
color_discrete_map={'MOVIE': '#1b2530', 'SHOW': '#3EB8FF'},
hover_data=['title'])

fig.update_layout(title='IMDb Score vs TMDB Score',
xaxis_title='IMDb Score',
yaxis_title='TMDB Score',
legend_title='Type', title_x =0.5)

fig.show()

Is it all a mind game 🤨 ?

Using Roberta model by Hugging Face to categorize movies as Positive, Neutral and Negative based on the description

Hugging Face’s Roberta model helps in gauging the sentiment of the content based on the description provided. Roberta excels in understanding context and language patterns and is better at sarcastic sentences, while VADER focuses on sentiment quantification.

The chart provides a quality assessment for each of the movies using the Roberta model. The model helps us in understanding what type of content is produced by Amazon and enjoyed by the audience.

What’s with all the fuss: Positive, Negative and Neutral

31% of content streaming on Amazon is categorized is Negative by the Roberta model based on it’s description.

According to the data, approximately 48% of the content on Amazon is categorized as neutral, indicating a lack of strong positive or negative sentiment. Furthermore, around 31% of the content is classified as negative, suggesting that a significant portion of the content elicits a negative sentiment or has negative connotations.

sentiment_counts = showtime['sentiment'].value_counts()

# Create the donut chart trace
fig = go.Figure(data=[go.Pie(
labels=sentiment_counts.index,
values=sentiment_counts.values,
hole=0.5, # Set the hole parameter to create a donut chart
marker=dict(colors=['#1b2530', '#0f79af', '#3EB8FF']), # Set custom colors for the slices
textinfo='label+percent', # Display labels and percentages
textposition='inside', # Set the position of the labels inside the slice
)])

# Set the layout
fig.update_layout(
title='Sentiment Distribution of Content on Amazon Prime Video',
showlegend=True, title_x = 0.5,
# Add annotations in the center of the donut pies.
annotations=[dict(text='Amazon Prime Video', x=0.50, y=0.5, font_size=15, showarrow=False)]
)
fig.show()

What kind of content are you consuming?

Sentiment Analysis based on content description on Amazon Prime Video

Based on the description of genres such as thriller, crime, western, and horror, these genres often evoke a negative sentiment. This could be due to the nature of the themes, settings, and storylines typically associated with these genres. Elements like suspense, criminal activities, conflict, and fear are commonly depicted, which can contribute to a darker or more negative tone in the overall sentiment of the content within these genres.

genre_sentiment = showtime.groupby(['primary_genre', 'sentiment']).size().reset_index(name='count')
genre_sentiment = genre_sentiment[genre_sentiment['primary_genre']!= 'No Data']
genre_sentiment = genre_sentiment.sort_values('count', ascending = False)
colors = ['#1b2530', '#0f79af', '#3EB8FF', '#ffffff']

fig = px.sunburst(genre_sentiment, path=['primary_genre', 'sentiment'], values='count',
color_discrete_sequence=colors)

fig.update_layout(title='Genre vs Sentiments on Amazon Prime Video', title_x = 0.5)

fig.show()

Keep your sentiments in check…

Sentiments classified by Age certification ratings

It appears that a higher proportion of R-rated content on Amazon is classified as negative. This could be due to the nature of R-rated content, which often includes mature or explicit themes, intense violence, or disturbing elements. Such content may be more likely to evoke negative emotions or elicit a darker tone, leading to a higher classification of negativity.

The most popular runtime for content on Amazon is approximately 90 minutes. This duration is commonly preferred by viewers, as it provides a balance between storytelling and engaging the audience without being too lengthy.

IMDb and TMDb ratings for Amazon Prime Video content

IMDb rating

IMDb ratings are based on a scale of 1 to 10. The ratings are determined by the votes and reviews submitted by IMDb users. IMDb takes into account the ratings given by regular users as well as those from industry professionals, such as critics and filmmakers.

TMDb rating

TMDb ratings also use a scale of 1 to 10, allowing decimal values. Similar to IMDb, TMDb ratings are determined by the votes and reviews provided by users of the platform. However, TMDb is known for having a more open rating system, where anyone can register and rate titles without any restrictions. This can result in a larger number of user ratings compared to IMDb, but it may also lead to a wider range of opinions and potentially less reliability.

Words mostly likely to make it to the title on Amazon Prime Video

showtime['title'] = showtime['title'].astype(str)
title_corpus = ' '.join(showtime['title'])
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image #to load our image
stopwords = set(STOPWORDS)
# Define a function to specify the text color
def amazon_color(word, font_size, position, orientation, random_state=None, **kwargs):
return "#ffffff"

custom_mask = np.array(Image.open('logo.jpg'))
wc = WordCloud(
stopwords = stopwords,
mask = custom_mask,height = 2000, width = 4000, color_func = amazon_color)
#background_color = 'white',
wc.generate(title_corpus)

plt.figure(figsize=(16,8))
plt.imshow(wc, interpolation = 'bilinear')
plt.axis('off')
plt.show()

Most popular ACTOR on Amazon Prime

With this we come to the end of our analysis on Amazon Prime Video. I hope it was entertaining to read through the article, and that you enjoyed it thoroughly :)

Amazon excels in leveraging numbers to its advantage, targeting a wide audience base primarily in America, Europe, and India. The platform adeptly caters to movie enthusiasts by offering a diverse range of genres, which are broadly classified with neutral and negative sentiments. This positioning makes Amazon Prime Video a suitable choice for a mature audience seeking varied content options.

The project code is available on my Github: https://github.com/shrunalisalian/Streaming-Wars

Amazon Prime Video Dataset on Kaggle: https://www.kaggle.com/datasets/dgoenrique/amazon-prime-movies-and-tv-shows

In case you enjoyed reading this article, feel free to check out articles on Netflix, HBO Max, Disney+, Paramount+ and Apple TV+

Feel free to let me know if you have any suggestions. Thank You for reading!

--

--