Analyzing Reviews of Alexa

Shreedhar Vellayaraj
Geek Culture
Published in
7 min readSep 11, 2021
Credit: Unsplash

In this blog, we are going to understand how to analyze and visualize the product reviews for Alexa available on amazon. I’ve found the dataset from Kaggle. You can download it from here.

The main objectives of the blog are,

  1. Import the libraries
  2. Visualizing the ratings using Matplotlib and Seaborn
  3. Creating a WordCloud
  4. Feature Engineering

Amazon Alexa, also known simply as Alexa, is a virtual assistant technology developed by Amazon, first used in the Amazon Echo smart speaker and the Echo Dot, Echo Studio, and Amazon Tap speakers developed by Amazon Lab126. The main competitor to Alexa is Google Home.

Let’s dive into the dataset.

First, download the data from the Kaggle. It is a .tsv file (tab-separated value).

The only library that we need to install is WordCloud, which we use to pictorially understand the significance of the most occurring word in our dataset.

To install WordCloud type, pip install wordcloud

This dataset consists of Amazon customer reviews, star ratings, date of review, variant, and feedback of various amazon Alexa products like Alexa Echo, Echo dots.

Our main goal is to perform sentiment analysis on the data and discover insights from the customer reviews.

Make sure to put the downloaded file amazon_alexa.tsv under the same folder where you are coding the project.

Importing the Data

We are going to import the necessary libraries and our .tsv file into our python code.

We use,

pandas- for data manipulation using data frames

numpy- for statistical analysis of data

matplotlib.pyplot- for data visualization

seaborn- for advanced(statistical) data visualization

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

We’ll load the data using pandas and print the first five rows in the table. You can use df_alexa.tail() to print the last five rows in the table.

df_alexa = pd.read_csv('amazon_alexa.tsv', sep='\t')
df_alexa.head()

Print the column names

df_alexa.keys()O/P:
Index(['rating', 'date', 'variation', 'verified_reviews', 'feedback'], dtype='object')

Let’s check out our verified_reviews column, as we’ll be doing a lot of analysis on that column

df_alexa[‘verified_reviews’]O/P:0                                           Love my Echo!
1 Loved it!
2 Sometimes while playing a game, you can answer...
3 I have had a lot of fun with this thing. My 4 ...
4 Music
...
3145 Perfect for kids, adults and everyone in betwe...
3146 Listening to music, searching locations, check...
3147 I do love these things, i have them running my...
3148 Only complaint I have is that the sound qualit...
3149 Good
Name: verified_reviews, Length: 3150, dtype: object

We’ve successfully imported our data. Now let’s visualize them. If you feel the codes are a little over the place, at the end of every section I’ll write the entire codes used for that section in one place. You can skip it if you’re clear with the code.

Complete code for the section(importing the data)

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df_alexa = pd.read_csv('amazon_alexa.tsv', sep='\t')
df_alexa.head()
df_alexa.keys()
df_alexa[‘verified_reviews’]

Visualizing the Data

In our feedback column, the positive reviews are numbered as ‘1’, and the negative reviews are numbered as ‘0’.

Print only the negative reviews

negative = df_alexa[df_alexa['feedback']==0]
negative

Let’s plot the positive and negative reviews using seaborn’s countplot

sns.countplot(df_alexa[‘feedback’], label = “Count”)

Plotting the rating column from 1 to 5

sns.countplot(x = ‘rating’, data = df_alexa)

We can also visualize them in bins

df_alexa[‘rating’].hist(bins = 5)

Let’s plot a bar chart with the dimension (40 * 15) with x-axis as the variation and the y-axis as the rating

plt.figure(figsize = (40,15))
sns.barplot(x = 'variation', y='rating', data=df_alexa, palette = 'deep')

Complete code for the section(visualizing the data)

negative = df_alexa[df_alexa['feedback']==0]sns.countplot(df_alexa[‘feedback’], label = “Count”)sns.countplot(x = ‘rating’, data = df_alexa)df_alexa[‘rating’].hist(bins = 5)plt.figure(figsize = (40,15))
sns.barplot(x = 'variation', y='rating', data=df_alexa, palette = 'deep')

Creating a WordCloud

WordCloud creates a beautiful visualization of words that are available in a string. The only requirement for the wordcloud to visualize the data is that the data should be in a string format.

Let’s create a sample wordcloud. Remember to install wordcloud using pip install wordcloud

import matplotlib.pyplot as plt
from wordcloud import WordCloud
blog_word_cloud = 'Love Machine Learning a lot. This is a Medium blog for Machine and Deep learning'plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(blog_word_cloud))

Before applying the wordcloud to our dataset, we need to convert them to a string.

words = df_alexa['verified_reviews'].tolist()len(words)O/P:
3150

The output says that we have 3150 reviews in our dataset.

Let’s print out the words

print(words)

Remember, Wordcloud can only display strings and not lists. So let’s convert the list into a string

string_from_words =" ".join(words)
len(string_from_words)
O/P:
419105

Our review has 419,105 words in it. We can visualize them using wordcloud, to derive any useful information.

from wordcloud import WordCloudplt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(string_from_words))

We’ve successfully visualized our reviews in the form of wordcloud. It’s pretty clear that the words ‘love’, ‘Alexa’, ‘great’ are used a lot in the reviews which pretty much says that the overall consensus is positive for Alexa.

Complete code for the section(creating a wordcloud)

words = df_alexa['verified_reviews'].tolist()string_from_words =" ".join(words)from wordcloud import WordCloud plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(string_from_words))

Feature Engineering

We can derive more insights from our data, when we are tweaking the model a little bit, such as understanding more about a specific review or finding out the largest and the smallest review. To do that we’ll use sklearn library.

from sklearn.feature_extraction.text import CountVectorizervectorizer = CountVectorizer()
alexa_countvectorizer = vectorizer.fit_transform(df_alexa['verified_reviews'])

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

word_count_array = alexa_countvectorizer.toarray()
alexa_countvectorizer.shape
O/P:
(3150, 4044)

The output (3150, 4044), says that we have 3150(total number of reviews- rows)and 4044(count vectorizer- columns)

Let’s print our first row within the array

word_count_array[0,:] O/P:
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

We’ll visualize it to understand it better, let’s take a random index = 18 and visualize it.

index = 18
plt.plot(word_count_array[index, :])
df_alexa['verified_reviews'][index]

The review on the index(18) says ‘We love the size of the 2nd generation echo. Still needs a little improvement on sound’.

The vertical bars on the chart indicate the frequent number of counts on the data.

Let’s create a new column within our dataset, to store the length of each character on every single review and visualize them.

df_alexa['length'] = df_alexa['verified_reviews'].apply(len)
df_alexa.head()

Our new column has been created. Let’s visualize the length of every review using the histogram.

df_alexa['length'].hist(bins=100)

From the above visualization, we can conclude that more reviews are within the 500 character range, however, there are some reviews that have more than 1500 characters.

Let’s visualize the minimum review,

min_char = df_alexa['length'].min()
df_alexa[df_alexa['length'] == min_char]['verified_reviews'].iloc[0]
O/P:'😍'

Visualize the maximum character review,

max_char = df_alexa['length'].max()
df_alexa[df_alexa['length'] == max_char]['verified_reviews'].iloc[0]

Complete code for the section(feature engineering)

from sklearn.feature_extraction.text import CountVectorizervectorizer = CountVectorizer()
alexa_countvectorizer = vectorizer.fit_transform(df_alexa['verified_reviews'])
word_count_array = alexa_countvectorizer.toarray()index = 18
plt.plot(word_count_array[index, :])
df_alexa['verified_reviews'][index]
df_alexa['length'] = df_alexa['verified_reviews'].apply(len)min_char = df_alexa['length'].min()
df_alexa[df_alexa['length'] == min_char]['verified_reviews'].iloc[0]
max_char = df_alexa['length'].max()
df_alexa[df_alexa['length'] == max_char]['verified_reviews'].iloc[0]

These are some of the ways that we can use to analyze a given dataset.

You can get the complete version of the code here.

--

--