Prediction Monitor’s Prices Getting Data From Amazon.com

Ozan ERTEK
İstanbul Data Science Academy
5 min readDec 29, 2021
Photos by amazon.com and Wallpapersden.com

In this project, we have used webscraping(beautifulsoup) for getting data.The aim for this project is to predict monitor’s price by using features that ratings of monitor ,specific uses and monitor brands.Pulled that features by beautifulsoup and predicted by linear regression.You can visit project repository HERE.The methodology of the project is given below.

  • Webscraping from Amazon.com(filtering minimum price for getting only monitor)
  • Defining Features.
  • Figuring out what to do with NaN values and monitor brands(adding dummy values.)
  • Choosing best Regression Model.

Each step, i will explain what i did and why.

1 ) Webscraping and Defining Features.

For first step, we need python libraries such as Numpy,Pandas,BeautifulSoup,requests.

we can download and install by using following codes in Jupyter :

!pip download pandas

!pip install pandas

So,we have all libraries we need now .

We have to open first page’s url that provide product url’s ( www.amazon.com/…. )

And then, we have to get all urls of pages.

html = getAndParseURL('https://www.amazon.com.tr/s?k=monitör&i=computers&rh=n%3A12601904031%2Cp_36%3A100000-&__mk_tr_TR=ÅMÅŽÕÑ&qid=1639244193&rnid=13736708031&ref=sr_nr_p_36_4')
page_links =[]
for i in range(70) :
links = str('https://www.amazon.com.tr/s?k=monitör&i=computers&rh=n%3A12601904031%2Cp_36%3A100000-&page=') + str(i+1) + str('&__mk_tr_TR=ÅMÅŽÕÑ&qid=1639244253&rnid=13736708031&ref=sr_pg_') + str(i+1)
page_links.append(links)

With these codes,we can pull all page urls(for 70 pages)for finding all product links.

And then, we want to pull product links in all pages that we want .

products_link = []
for page in new_page_links :
html = getAndParseURL(page)

for b in html.find_all('a',{'class':'a-link-normal a-text-normal'}):
a= str("https://www.amazon.com.tr")+ (b['href'])
products_link.append(a)

So,we have got all products link now.Just one step left to get all product features.That step is to find product features’ html location.

productlist= []
for link in products_link:
plink2 = getAndParseURL(link)
results = plink2.find_all('div',{'id':'centerCol'})

for item in results :
try :
j = item.find('span',{'class':'a-offscreen'}).text.replace('TL','').replace('\xa0','')
except :
j =''
try :
r = item.find('span',{'id':'acrCustomerReviewText'}).text
except :
r =''
try :
y = item.find('td',{'class': 'a-span9'}).text

except :
y = ''
try :
h = item.find('a',{'id':'bylineInfo'}).text

except :
h = ''
time.sleep(5)
product = {
'prod_name' : item.find('span',{'id':'productTitle'}).text,
'prod_comp' : h,
'prod_price' : j ,
'prod_rating' : r ,
'prod_feature': y,
}
productlist.append(product)

All product features to be found by these codes.And then we add these features to csv.

By below function,we can create DataFrame.

def output(productlist):
productsdf =pd.DataFrame(productlist)
productsdf.to_csv('output1.csv',index = False)
print('Saved to CSV')

And then,our dataset is ready to use.After web-scraping process our Dataset looks as below.

2 ) Figuring out what to do with NaN values and monitor brands(adding dummy values)

In our dataset,we have too many NaN values and unique values that has same meaning as (Gaming and Gaming. or Work and Work,Education in same column).We have to find a solution for this situation.That solution is to make unique values in Specific Uses ( Prod_feature column ) such as Gaming,Personal,Education,Work.So that solutions codes like as below

for i in range(len(df['prod_feature'])):
if df['prod_feature'][i] == 'Video Editing':
df['prod_feature'][i] = 'Video processing'
elif df['prod_feature'][i] == '76 Hz, 70 Hz':
df['prod_feature'][i] = 'Gaming'
elif df['prod_feature'][i] == 'VGA, HDMI':
df['prod_feature'][i] = 'Gaming'
elif df['prod_feature'][i] == '24 İnç':
df['prod_feature'][i] = 'Personal'

for all unwanted unique values.And second process is to fill NaN values in (product_price and) (product_rating column) .firstly we have to change dtype to float.Because at the beginning all data types were as object.

df['prod_price'] = df['prod_price'].str.replace(',','.')
df['prod_price'] = df['prod_price'].astype(float)
df['prod_price'] = df['prod_price'].replace('.','')

and then,we create random values around mean of product price column.Because this process more significant than fill only mean value.(Mean value = 4500).And 112 is prod_rating mean value that we added.

random = np.random.randint(3000,6000,size=(2000))
random =pd.Series(random) # mean value = 4500
df['prod_price'] = df['prod_price'].fillna(random)
df['prod_rating'] = df['prod_rating'].fillna(112)
df['prod_price'] = df['prod_price'].apply(lambda x : df['prod_price'].fillna() if x > 8000 else x)

Third process is to design brand column.If we have just one product for that monitor,this brand of monitor must be equal Brand :OTHER.

For this process, we need two querry such as :

query = df['prod_comp'].unique()
query2 = df['prod_comp'].value_counts()
for i in range(len(df)):
for j in range(len(query)):
if (df['prod_comp'][i] == query[j]) & (query2[j] < 3):
df['prod_comp'][i] = 'Brand : Other '
else :
df['prod_comp'][i] = df['prod_comp'][i]

With this process,we made that if brand has 1 ,2 and 3 product only ,it going to equal Brand : Other.

So,our dataset is ready for Machine Learning.Now :

3) Choosing Best Regression Model

Firstly,we have to create dummy variables for all categorical columns by this code as below :

df = pd.get_dummies(df,columns=['prod_comp'])
df = pd.get_dummies(df,columns=['prod_feature'])

and now,choosing X and y (independent and dependent variables for Linear Regression Model)

X = df.iloc[:,2:33]
y = df.iloc[:,1]

X variables are all categorical columns,y variable is product price only(target)

X_train, X_validation, y_train, Y_validation = train_test_split(X,y, test_size=0.20, random_state=5)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred= regressor.predict(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df

y_pred = regressor.predict(X_test)
y_pred

df_pred = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df_pred.head()

Our model is very bad at this point(for all independent variables)(r²= 0.13).We have to find a more powerful model at next point.So,we would better see on graph.

plt.figure(figsize = (25,15))
sns.set(font_scale = 3)
sns.regplot(Y_train,predictions)
plt.xlabel('Actual_Values')
plt.ylabel('Predicted_Values')
plt.title('Actual_Values vs Predicted_Values')

In this graph ,we can see our features and dataset are not enough (n = 320).Because,spread of values is too high.

So,we need to select best independent variables that has good correlation.

Therefore,We should look heatmap of the features

df_corr = df.corr(method = 'spearman')
df_corr.head()
df_brand = df_corr.iloc[:,0:21]
df_feature =df_corr.iloc[:,21:31] # i have cut for price and rating 

For Correlation Matrix,we choose as X that have best correlation on price for our regression model.

X_new = df.loc[:,['prod_comp_Brand: ASUS','prod_comp_Brand: Viewsonic','prod_feature_Multimedia','prod_feature_Gaming','prod_feature_Work']]y = df.iloc[:,1]

After selecting these features,best model found as OLS .

r² = 0.45

F-statistic is significant

t value for all independent variables is significant (|t|<p for all x)

By the way,i tried others Linear Methods but not a good solution.All methods that i tried,on my github HERE.Thank you for reading.

Hope to see you again in my next article…

Ozan ERTEK

--

--