Predicting Smartphone Prices

Jason Rahimi
INST414: Data Science Techniques
12 min readMay 15, 2024

To begin, our report is centered on a pressing and highly relevant question in the realm of consumer technology, specifically within the smartphone market. The motivating question we seek to answer is: “Based on a set of specifications that a consumer would want a smartphone to have, what is the price that a consumer can expect to pay for a smartphone with the features they desire and need?”. This question is critical as it addresses a common dilemma faced by consumers who are looking to purchase a new smartphone but are unsure about the cost implications of their desired features.

Our primary stakeholder in this inquiry is the everyday consumer who is in the market for a new smartphone. This demographic ranges widely, encompassing individuals who need the latest up-to-date features for gaming, and individuals who need multiple cameras for photography. Each subgroup within this broad audience has its own distinct preferences and budgetary constraints, making our research beneficial for a wide range of buyers.

The relevance of this question stems from the rapidly evolving smartphone market, where new models are frequently introduced, each showing new or improved features that can significantly affect pricing. Consumers often struggle to keep pace with these changes, making it challenging to make informed purchasing decisions. By addressing this question, our report aims to expose the cost associated with various smartphone specifications. This includes analyzing trends in pricing relative to innovations in technology such as camera quantity, battery life, processing power, and additional features like 5G compatibility.

The actionable insight provided by our analysis is a comprehensive guide that assists consumers in understanding how different specifications influence the overall cost of a smartphone. This guide will enable consumers to identify which features are worth investing in, based on their personal needs and financial limitations, and will ultimately assist them in selecting a smartphone that offers the best value for their money.

The Data

When addressing the question of how smartphone features correlate with price, the ideal dataset would be comprehensive and detailed, capturing a range of specifications that influence consumer decisions. The key attributes that we determined essential for this analysis include the price, availability of 5G technology, processor speed, number of rear cameras, screen size, refresh rate, and availability of fast charging. Each of these factors plays a pivotal role in determining the value and appeal of a smartphone, influencing both performance and user experience.

For our study, we acquired a dataset from Kaggle, a platform that hosts a vast array of data science projects and datasets. The data originated from Smartprix, which is a well-regarded comparison site that aggregates prices and specifications of smartphones and other electronic devices from various retailers. This data set is particularly useful because it compiles information from multiple sources, providing a broad view of the market and the variety of smartphones available to consumers.

Data Cleaning

After obtaining the dataset, our initial exploration revealed that it consisted of 980 entry rows spread across 25 distinct columns, encompassing a wide range of specifications and features. However, to hone our analysis on the most influential factors, we decided to refine the dataset by excluding certain columns that were less critical to our research question or that were redundant. For instance, we removed the front camera specifications, as the majority of smartphones typically feature one front camera, rendering this variable less needed. Similarly, we excluded the processor brand and screen resolution pixels. We removed the processor brand as it did not have too much of an impact on our research questions and when we one-hot encoded this, it made the predictions from all regression models less accurate. We removed the screen resolution feature because the vast diversity of screen resolutions would likely make it difficult to compare smartphones. Further data cleaning involved the removal of null values to ensure the integrity of our analysis. Null entries can introduce bias or distorted results, so cleaning these from our dataset was crucial for maintaining accuracy.

Data Preprocessing

Boxplot for Price
Boxplot for Price
Boxplot for Battery Capacity
Boxplot for Battery Capacity

We also conducted an outlier analysis to ensure that our data did not include any anomalous values that could skew the results. By developing boxplots for key numerical fields like price and battery capacity that would likely have significant outliers, we could see any outliers that were present in the dataset and then by using Python’s ‘nlargest()’ function, we identified their indices and then removed them.

To manage categorical data within our dataset, particularly the operating system (OS) specification, we employed one-hot encoding. This method is crucial for handling categorical variables like the OS, which consists of multiple distinct categories. One-hot coding transforms the OS column into several binary columns, each representing one of the various OS types. This transformation is essential for enhancing the accuracy of our predictive regression models, as it allows these models to better process and integrate categorical data, leading to a more precise and reliable prediction.

Our dataset, now meticulously curated and enhanced, offers a robust basis for our analysis. With these preparations, we can accurately dissect how each specified feature affected the pricing of smartphones. This detailed examination will allow us to construct predictive models that can estimate price points based on desired features, offering consumers valuable insights into the cost-effectiveness of their potential purchases.

In addition to supporting our initial query, this dataset lays the groundwork for future explorations into trends and shifts within the smartphone market. Such analyses could provide forecasts on how prices might evolve as new technologies emerge, aiding consumers and industry stakeholders in anticipating changes and making informed decisions. This insight is particularly valuable in an industry known for its rapid innovation and frequent product launches.

By leveraging this data, our study aims not only to illuminate current pricing strategies but also to empower consumers to make choices that align with exactly what they are looking for. Through this research, we contribute to a more transparent marketplace where consumers can readily assess the trade-offs between different smartphone features and their associated costs.

Methods

In our project, we built upon a selection of statistical methods and machine learning models as taught in module 6 and module 7 of our course. The chosen methods were specifically aligned with our research question, which focused on how specific features affect smartphone pricing. This alignment was crucial as it ensured the applicability of these methods to the real-world data we were analyzing. Below, I will detail the rationale behind each method and how they were applied to our study, ensuring a comprehensive understanding of their relevance and effectiveness.

The linear regression model, a staple in predictive analytics, was one of our primary analytical tools. This model assumes a direct linear relationship between independent variables (such as screen size, processor speed, and number of cameras) and the dependent variable, which in our case was the price of the smartphone. We selected linear regression for its simplicity and efficiency in estimating relationships where changes in one or more independent variables are expected to directly influence the dependent variable. This allowed us to establish a baseline understanding of how each feature contributes to pricing, providing clear, interpretable insights into which features are most valuable to consumers. Furthermore, this model served as the foundation for our analysis, helping us to quickly identify trends and assess the strength of correlations between features and pricing.

Logistic regression, an algorithm which is best used when predicting binary outcomes, was utilized to understand its performance when predicting continuous values. This method is best used when estimating the probability of an event occurring. However, it can still be somewhat useful when predicting continuous values and we wanted to test its performance when using it to predict smartphone prices.

To further enhance our model’s accuracy and manage the complex interplay of features in high-dimensional data, we employed the random forest regression model. This group method builds multiple decision trees and merges their results to provide a more accurate and stable prediction. Known for its robustness against overfitting, the random forest model was ideal for our data-rich environment. It was particularly crucial in identifying the most influential features in smartphone pricing, offering insights beyond what simpler models could provide. Additionally, random forest helped in understanding how combinations of features interact to influence pricing, which is vital for constructing more tailored predictive models. This model also allowed us to handle the non-linear relationships that are often present in complex datasets like ours, providing a more nuanced understanding of the data.

We assessed the accuracy of our regression models using root mean squared error (RMSE) and mean absolute error (MAE), two key performance metrics from module 7 in our course. RMSE measures the square root of the average squared differences between predicted and actual values, highlighting how dispersed these predictions are around the actual values. MAE provides an average of absolute errors, offering a straightforward measure of error magnitude without considering direction. These metrics were vital in evaluating model performance, making sure that our predictions were precise and reliable. They also allowed us to fine-tune our models and improve their predictive accuracy, which is necessary for providing actionable insights to consumers.

The application of these methods provided a rigorous analytical framework that adhered to the academic standards of our course while addressing the practical needs of smartphone consumers. By integrating these methods, we were able to offer another way for consumers to make informed decisions when buying smartphones.

Analysis

Linear Regression Line Plot
Linear Regression Line Plot

To answer the question we posed, we first split our dataset into train and test sets using train_test_split(). Then we developed our first model, a linear regression model. After training the model and then predicting prices for the test set, we found that the RMSE for the model was 14283.87 and the MAE for the model was 9321.46. We believe that values from these performance indicators show that the model performed fairly well. We also put the test set and the predicted prices into a dataframe and plotted each row in the dataframe on a line plot to further understand the accuracy of the model’s predictions. There are a few predicted prices that are not close to the actual prices. However, the model’s predicted prices generally follow the prices in the test set.

Logistic Regression Line Plot
Logistic Regression Line Plot

The second supervised learning model we developed was a logistic regression model. Before training the model, we changed the “solver” hyperparameter which determines the optimization algorithm used during training from its default value which was ‘lbfgs’ to ‘liblinear’ which is a solver that is considered a good choice for small datasets such as the one we are using. After training the model and then predicting prices for the test, we found that for the logistic regression model, the RMSE was 15854.39 and the MAE was 8407.26. These values indicate that the accuracy of the model’s predictions was similar to the accuracy of the predictions made by the linear regression model. This is confirmed after plotting the predicted prices and the actual prices on a line plot. We also found that, after plotting the data, the model’s predictions somewhat closely match the actual prices.

Random Forest Regression Model Accuracy Line Plot
Random Forest Regression Model Accuracy Line Plot

The third supervised learning model we developed was a random forest regression model. To understand which value for max_depth would likely result in the most accurate predictions, we developed a for loop that would test the accuracy of the model when its max_depth is set from values 2 to 40. We inserted the resulting RMSE and MAE values into a dataframe to visualize the accuracy of the model on a line plot. After plotting the values, we found that the change in the model’s performance became insignificant after max_depth is set to 6.

Random Forest Regression Line Plot
Random Forest Regression Line Plot

To try to get the best value for max_depth we wrote code that would dynamically select a max_depth value based on which max_depth value would give the model the lowest RMSE score. For the visualizations, the max_depth value that was chosen was 12 and with this max_depth value, the model’s RMSE value was 9961.52 and the MAE value was 5402.94. The accuracy of the model’s predictions are also shown on the line plot. Overall, the random forest regression model was the most accurate model.

Insights

To understand how the predictions made by the models can influence a consumer’s decision on what type of smartphone to buy. We developed two dataframes, one dataframe that represents a smartphone used primarily for gaming and another dataframe that represents a smartphone used primarily for photography.

The specifications that were assigned to the gaming smartphone were that it has 5g, it does not have nfc, it does not have an ir blaster, its processor has 8 cores, the speed of its processor is set at 3.6 GHz, its battery capacity is set at 6,000 mAh, it is capable of fast charging, it has 16GB of RAM, it can store up to 500GB of data, its screen size is 6.8 inches, the screen’s refresh rate is 240Hz, it has 1 rear camera, and it uses the Android operating system. These specifications can be considered the optimal specifications for a gaming smartphone with this smartphone having the features that are important for gaming such as high battery capacity, a fast processor, a large screen, and a fast refresh rate. Each of the models we developed predicted a different price for this type of smartphone. The linear regression model predicted that its price would be 80872.25, the logistic regression model predicted that its price would be 59999, and the random forest regression model predicted that its price would be 96705.03. Overall, given the performance of the smartphone, we think that the price predicted by the random forest regression model is likely the best predicted price and that a consumer can expect to pay around 96705.03 for this type of smartphone.

The specifications that were assigned to the photography smartphone were that it has 5g, it does not have nfc, it does not have an ir blaster, its processor has 8 cores, the speed of its processor is set at 3.0 GHz, its battery capacity is set at 5,000 mAh, it is capable of fast charging, it has 12GB of RAM, it can store up 1,000GB of data, its screen size if 6.8 inches, its refresh rate is 60Hz, it has 3 rear cameras, and it uses the Android operating system. These specifications can be considered the optimal specifications for a photography smartphone with this smartphone having enough processing power for computational photography which is used to enhance images, a large amount of internal storage, and 3 rear cameras. Each of the models we developed predicted a different price of this type of smartphone. The linear regression model predicted that its price would be 149643.32 the logistic regression model predicted that its price would be 214990, and the random forest regression model predicted that its price would be 112693.39. Overall, given this phone’s capabilities, we think that the price predicted by the random forest regression model is likely the best predicted price and that a consumer can expect to pay around 112693.39 for this type of smartphone.

Predicted Smartphone Type Prices Using Linear Regression
Predicted Smartphone Type Prices Using Linear Regression

The bar plot above represents the prices predicted by the linear regression model. There is a noticeable difference between the prices for the two smartphone types.

Predicted Smartphone Type Prices Using Logistic Regression
Predicted Smartphone Type Prices Using Logistic Regression

The bar plot above represents the prices predicted by the logistic regression model. There is more of a difference than the linear regression model in the prices predicted for the two smartphones.

Predicted Smartphone Type Prices Using Random Forest Regression
Predicted Smartphone Type Prices Using Random Forest Regression

The bar plot above represents the prices predicted by the random forest regression model. This plot shows that there is much less of a difference in predicted prices compared to the linear regression and logistic regression models.

Limitations

While our analysis consists of the use of well developed models, they could be improved and are held back by several limitations. One of these limitations is a lack of features in the dataset. The models could be further developed with more features. For example, a feature that could be developed is a “material” feature that highlights the difference in materials used to build each smartphone such as plastic or aluminum with the cost of materials likely correlating with the price of a smartphone. Another feature that could be useful is a rating for smartphone camera quality. For example, multiple rear cameras on a smartphone does not indicate the quality of each of the cameras and higher quality cameras could correlate with the price of a smartphone. Another limitation is the “price” feature which represents the price a consumer would pay in rupees and not in U.S. dollars.

Code for our analysis:

https://github.com/JasonRahimi2/INST414_Project

The dataset used for our analysis:

Appendix:

Appendix
Appendix

--

--