Predicting Laptop Prices Using ML

Andy Guozhen
Analytics Vidhya
Published in
9 min readJun 18, 2021

Building an end-to-end web application that predicts the prices of laptops.

Photo by XPS on Unsplash

Initial motives

Being an absolute beginner in machine learning/data science, I wanted to leverage my skills and dive deeper into the realms of machine learning. After countless attempts of watching YouTube videos and taking courses from Coursera, I’ve decided to showcase what I’ve learned by building a web application that predicts the prices of laptops. Having that said, I’ve broken down my projects into 8 different parts as it allows beginners (like me) to follow up as well. Feel free to explore the project files in my Github repo.

This article will discuss about:

1. Obtaining Laptop Prices dataset

2. Basic data exploration

3. Feature Engineering

4. Explanatory Data Analysis (EDA)

5. Data Preprocessing

6. Modelling

7. Building web interface

8. Deployment of web application

Obtaining Laptop Prices Dataset

For this project, I’ve obtained my dataset (by Ionas Kelepouris) from Kaggle. This dataset contains 1300 rows of data and 12 columns (features) that we could focus on to build our prediction model.

Basic Data Exploration

After loading the dataset via Pandas, we can see a list of laptops and specs that are associated with each laptop.

df=pd.read_csv('laptops.csv',encoding='ISO-8859–1')
df

Looking at the dataset, we can see that some columns such as ScreenResolution and Cpu have alphanumeric data while other features consist of purely numerical or alphabetical values. These data would need to be filtered and engineered later.

To avoid any complications and error-prone predictions, useless features such as “Unamed:0”, “Company” and “Product” will be removed from the dataset. (Having an apple laptop with windows OS doesn’t seem to make any sense for price prediction.)

We would then check for any missing values that are present in the dataset. We wouldn’t want any errors to result from analysis or modeling later.

df.isnull().sum() 
The figure shows all columns have no empty values.

Feature Engineering

We would now extract and reorganize our data to better understand the underlying factors that contribute to the price of laptops.

If we take a look at the ScreenResolution column, there seems to be laptops with touchscreen capabilities. Since touchscreen laptops are known to be more expensive than those without them, a TouchScreen feature would be added to mark laptops with such capabilities.

df.loc[df['ScreenResolution'].str.contains('Touchscreen'),
'TouchScreen']="Yes"
df.tail(10)

We would then extract and replace the screen resolution column with their respective pixel count using regular expressions. I find regular expressions incredibly useful when it comes to extracting/filtering alphanumeric values.

df['ScreenResolution']=df['ScreenResolution'].str.
extract('(\d\d\d\d?x\d\d\d\d?)',expand=True)
df.tail(10)

We would then apply the same process for engineering the Cpu, Ram, and Weight features. Our goal is to minimize or remove any units and words that are not essential for analysis later.

Now comes the most tiring part of feature engineering, dealing with memory feature. Upon closer inspection, the memory column contains various types of memory (SSD, HDD, SSHD, and Flash Storage). We would need to create 4 additional columns representing different memory types and extract their memory capacities individually. (Additional processing needs to be done for laptops having double memory configuration that uses the same memory types.( EX: 256GB SSD + 512GB SSD). This could be done using a similar process shown above.

df['HDD']=df['Memory'].str.extract('(\d\d\d?GB\sHDD|\dTB\sHDD|\d\.0TB\sHDD)',expand=True)df['HDD']=df['HDD'].str.extract('(\d\d\d?|\dTB|\d\.0TB)',expand=True)df['HDD']=df['HDD'].str.replace('(TB|\.0TB)','000',regex=True)
df['HDD'].fillna(0,inplace=True)
df.head(30)
Row 5 reflects the engineered result for HDD memory.

Having those memory configurations handled, I’ve decided to drop the GPU column entirely as it contains a high variability of GPUs. Intel GPUs are integrated GPUs, Nvidia GPUs are discrete while AMD GPUs are either integrated or discrete. Labeling and classifying each would take significant effort and time which may or may not contribute to the modeling process later on.

Our fully feature engineered dataset would look like this:

df

We then save the dataset into a new .csv file and prepare for the Explanatory Data Analysis process.

Explanatory Data Analysis (EDA)

Using our feature-engineered dataset, we can now plot graphs and compute tables to visualize how each feature relates to the variability of laptop prices.

By using the .barplot method imported from Matplotlib, we can test and verify our hypothesis or initial opinions on how some features will affect the pricing of laptops. Here’s an illustration of plotting a barplot for the feature TypeName (type of laptop):

plt.subplots(figsize=(10,5))
sns.barplot(x='TypeName',y='Price',data=df)

From the barplot above, we can rectify and conclude that, on average, workstation and gaming laptops have a higher price than other types of laptops. This is to be expected as these types of laptops often have better spec configurations (better CPU, more memory, etc) to meet the demands of clients in the professional workspace. Notebooks and netbooks have lower prices due to their low-powered configurations.

Higher Ram capacities also reflect higher prices in laptops:

plt.subplots(figsize=(20,10))
sns.barplot(x='Ram',y='Price',data=df)

Barplots on screen sizes shows inconsistent prices:

plt.subplots(figsize=(10,10))
sns.barplot(x='Inches',y='Price',data=df)

Plotting bar graphs on the Cpu features shows some interesting results. In general, higher-powered processors should be priced higher than lower-powered ones. The prices for intel processors generally follow this pattern (Xeon > i7>i5>i3) and the same principles apply to AMD CPUs as well (Ryzen > AMD A series> E series). The barplot obtained from the dataset shows otherwise.

plt.subplots(figsize=(20,10))
sns.barplot(x='Cpu',y='Price',data=df)

The plot shows Intel M series laptops having a higher average price than i3 and i5 laptops. For this reason, a further inspection of this scenario is needed to better understand the occurrence of this phenomenon.

Using the pandas .grouby() method, we can compute the averages of features that relate to i5 and intel M laptops.

standby_df = df.loc[df['Cpu'].str.contains('Intel M |Intel i5')]
standby_df.groupby(['Cpu']).mean()

Intel M laptops, on average, have a higher RAM capacity, lighter in weight, higher SSD memory capacity than i5 laptops. These features contribute to the higher prices shown for Intel M series laptops.

Data Preprocessing

In this section, we will relabel and convert categorical features into numerical features. This is essential for training our ML models as ML models only accept numerical values as inputs.

Starting off, we identify features that are non-numerical (Object type) and compute their cardinalities (categories present in each feature).

#Prints all columns that are of type objectfor i in df.columns:
if df[i].dtype=='object':
print("%-10s\n%-200s\n%-10d\n"%(i,df[i].unique(),df[i].nunique()))

Knowing that the TouchScreen feature only has 2 categories, we can use label encoding to encode this feature. (One-hot-encoding can be used too) Using the Scikit-learn’s label encoding function, the variables present in TouchScreen (‘No’, ‘Yes’) will be encoded into 0s and 1s.

from sklearn.preprocessing import LabelEncoderle=LabelEncoder()
print(df['TouchScreen'].unique())
print(le.fit_transform(df['TouchScreen'].unique()))
# to check what encoded values would be

Label encoding also handles features with high cardinalities. Applying label encoding to the Cpu feature, the label encoded values (associated with their pre-encoded variables) are recorded for predicting purposes later.

Other features with slightly lower cardinality were encoded via the one-hot-encoding method. Through the use of pandas’ .getdummies() method, a new column will be created to indicate the presence of each categorical variable.

Here’s an example illustrating the effects of one-hot-encoding on the TypeName feature:

After applying One-hot-encoding to TypeName and OpSys features, we will use manual encoding to deal with features with high cardinality if we know the order of variables.

We can use python’s dictionary and mapping methods to specify and encode each category based on their magnitude/order. The code snippet shown below encodes the ScreenResolution feature based on the pixel count.

Screen_Res_dict={'1366x768' : 1, '1440x900' : 2 , '1600x900': 3, '1920x1080': 4, '1920x1200': 5, '2160x1440': 6, 
'2304x1440': 7, '2256x1504': 8, '2560x1440': 9, '2400x1600':10, '2560x1600':11, '2880x1800':12,
'3200x1800': 13, '3840x2160':14}
OH_df['ScreenResolution']=
OH_df.ScreenResolution.map(Screen_Res_dict)

Going through all the necessary data processing steps, we can now save the new dataset into a .csv file for modeling later.

Modeling

After loading the preprocessed .csv dataset, we identify our dependant variable (Price) and allocate a separate data frame for the target variable.

We can then split the dataset for training and validating the performance of the models we are going to apply later. Roughly 30% of the data would be used to test our ML models later.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.33, random_state=0)

Next, we begin to train and validate the performance for different models. The two main metrics used for estimating the performance of our models would be the R-squared score and the Mean absolute error (MAE). In general, we want to achieve a higher R-squared score and lower MAE score with our models.

RandomForestRegressor with different number of nodes:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
def get_mae_random(max_leaf_nodes,X_train,X_test, y_train, y_test ):
model1 = RandomForestRegressor(max_leaf_nodes = max_leaf_nodes, random_state=0)
model1.fit(X_train,y_train)
model1_preds = model1.predict(X_test)
print("R2 score: %.2f"%(r2_score(y_test,model1_preds)))
mae = mean_absolute_error(y_test, model1_preds)
return (mae)
for max_leaf_nodes in [5,10,20,50,100,200,300, 500]:
mae=get_mae_random(max_leaf_nodes,X_train,X_test,y_train,y_test)
print("Max leaf nodes: %d \t\t MAE: %d\n"%(max_leaf_nodes, mae))

Training with DecisionTreeRegressor:

from sklearn.tree import DecisionTreeRegressordef get_mae_decision(max_leaf_nodes,X_train,X_test,y_train, y_test):
model1 = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state=0)
model1.fit(X_train,y_train)
model1_preds = model1.predict(X_test)
print("R2 score: %.2f"%(r2_score(y_test,model1_preds)))
mae = mean_absolute_error(y_test, model1_preds)
return (mae)
for max_leaf_nodes in [5,10,20,50,100,200,300, 500]:
mae = get_mae_decision(max_leaf_nodes,X_train,X_test,y_train,y_test)
print("Max leaf nodes: %d \t\t MAE: %d\n"%(max_leaf_nodes, mae))

Linear Regression Model:

from sklearn.linear_model import LinearRegressionmodel2=LinearRegression()
model2.fit(X_train,y_train)
preds=model2.predict(X_test)
r2score=r2_score(y_test,preds)
MAE = mean_absolute_error(y_test, preds)
print("R2 score: %.2f"%(r2score))
print("MAE: %d"%(MAE))

XGBoost Model:

from xgboost import XGBRegressormodel3 = XGBRegressor()
model3.fit(X_train,y_train)
preds3=model3.predict(X_test)
r2score=r2_score(y_test, preds3)
MAE = mean_absolute_error(y_test, preds3)
print("R2 score: %.2f"%(r2score))
print("MAE: %d"%(MAE))

Having the highest R-squared score and lowest MAE score among all models, the XGBoost model will be selected as our final ML model. We can save the trained model by using joblib.

Web App Deployment

Having literally no experience with HTML and CSS, I had to rely on freeCodeCamp’s beginner-friendly videos to construct my web app. After hours of debugging positioning errors and struggles with HTML and CSS, I’ve finally got myself a functional web interface.

With that out of the way, the backend of the web app can be handled using flask and python. By retrieving HTML form data using the GET/POST methods from Flask and storing the data to an array, we can load the XGBoost model we’ve previously trained and start predicting the prices of laptops.

model = joblib.load('XGB_model')#predict price of laptop given the above dataset
def predict(self, list):
arrToPredict = np.array([list])
self.totalPredicted = model.predict(arrToPredict)

After ensuring all the tests and web app is working as intended, we can save the project files a Github repo and deploy the web app by using Heroku. Here’s a great article by Naivedh Shah explaining the ins and outs of deploying your ML web app on Heroku.

Conclusion

I hope you’ve gained some useful insights about the basics of creating and deploying an ML web app. Don’t stress yourself if it doesn’t work out for you on the first try, this project took me more than 3 months to accomplish :)

With that said, I’m eternally grateful to Krish Naik and freeCodeCamp for providing the necessary knowledge for me to accomplish this project. It has been (and will still be) a great pleasure learning something new from you guys.

Don’t forget to like it if you enjoy the article :) Let’s connect on

Linkedin: linkedin.com/in/andy-foo-guo-zhen-791a58174

Github: https://github.com/AndyFooGuoZhen

--

--

Andy Guozhen
Analytics Vidhya

Hi! I'm a machine learning enthusiast , currently attending computer science at Iowa State University.