US University Rank Regression

Dominic Graziano
INST414: Data Science Techniques
2 min readDec 3, 2023

Data and Collection

Overall, this project seeks to produce a model to predict US universities’ rank based on attributes such as their tuition and enrollment numbers. I found a dataset which included the college institution, rank, yearly tuition, and enrollment numbers on kaggle. This regression would help in ranking universities using known data, making some colleges more marketable and appealing than others. The code is stored in a Jupyter Notebook and I used libraries such as Pandas, Numpy, and Scikit-Learn to complete this regression in Python.

Data Cleaning

Overall, there was not much data cleaning with this dataset, the main thing I did to prep the data was splitting the dataframe into an x and y based on the target attributes, and converted them to Numpy arrays.

Analysis

After splitting the data into x and y I could break it apart further into the test and train data for each aspect and I found that a test size of .175 and random state of 0 was the most accurate for the regression. From this more of the code could be run where we could get the actual predicted values based on the input data, then this was added into the dataframe as a predicted rank. I additionally created a new column which contained the difference between the predicted and actual ranks of colleges.

Limitations

One of the main limitations to this is a lack of data contributing to the predicted value. Only taking into consideration two data attributes produces an inaccurate prediction on where each university is ranked. I additionally ran a mean squared error test as well as r-squared test which revealed that the model was not very accurate.

Github Link

--

--