Analytics Vidhya
Published in

Analytics Vidhya

Get Rid Off Fake Reviews with a simple technique!!

Photo by visuals on Unsplash

With the increasing demand for online shopping, Reviews became a keyword in our lives. Most of us check for ratings and read reviews before buying a product online. We can find at least 2 different brands for a single product. Here comes competition.

Certain parties can misuse these reviews aiming to gain popularity or to discredit a product using “Fake Reviews” . These are also termed as “spam reviews”.

One may consider the reviews as fake if

a. they are from different users on same product

b. they are from same user on different products

c. they are from different users on different products.


Any similarity metrics such as cosine similarity can be used to find similar reviews. Then we will verify the user id and product id of each review.

Procedure to be followed is:

  1. Remove the stop words.
  2. Remove commonly used words in online shopping such as bad, good, price, product etc..
  3. Obtain root words using any stemming or lemmatization technique.
  4. Calculate similarity score using cosine similarity

5. If score is greater than threshold then we will check if these reviews are from same users or same products. If either of conditions met then we can consider review as fake.


We will implement above simple steps on amazon reviews dataset.

Perform stemming and removal of stop words.

from nltk.stem import PorterStemmer ps=PorterStemmer()data['Review'] =  data['Review'].apply(lambda x: [ps.stem(y) if y not in stopwors for y in x] )

Calculate cosine similarity.

from sklearn.feature_extraction.text import CountVectorizercv = CountVectorizer(stop_words='english')tfd= cv.fit_transform(data.Review)
new_pd = pd.DataFrame(tfd.toarray(),columns=cv.get_feature_names())
from sklearn.metrics.pairwise import cosine_similarity
matrix = cosine_similarity(new_pd)

Check for threshold and required conditions to be met.

for i in range(len(matrix)):
for j in range(i):
if(i!=j and matrix[i][j]>0.7 and raw_data.iloc[i,:]
['PRODUCT_CATEGORY'] == raw_data.iloc[j,:].


Remove these rows from data frame.



Removing fake reviews can improve sentimental analysis. Also this technique can improve results of topic modelling.

Complete Code

I have applied above technique on “Amazon Reviews” dataset and performed sentiment analysis and topic modelling. Full code can be accessed here.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sai Durga Mahesh

Sai Durga Mahesh


Using Data Science to provide better solutions to real word problems