Finding similar colleges and universities based on a user’s geographic and academic requirements

Elliott Bauer
INST414: Data Science Techniques
5 min readMar 31, 2024

For this my module 3 assignment, I decided to create a program that could serve to be useful for incoming college students. A question that can be answered by measuring similarity between data, is “what are the top 10 colleges or universities that are linked to an incoming student when they input another school that appeals to them?” Finding the answer to this question would inform these future students and their families what other schools they should be looking into. For one, there are never guarantees when applying places, so it is good to have other options to pursue if necessary. Also, it could make them aware of schools that they had not previously thought about, or that they previously did not think was a realistic match. The data that I used was a very comprehensive dataset, howefver I simplified it down so that it only showed the noteworthy columns I intended to use. It contained a variety of data, with over a thousand columns of academic, geographical, demographic, and other miscellaneous data for every university in the United States. The columns I chose to focus on were the city, state, latitude, longitude, SAT (Scholastic Aptitude Test) average, as well as HBCU (Historically Black College and University) and PBI (Predominantly Black Institution) columns in case a user wanted to filter based on that. Most fields were strings, except for the latitude and longitude columns, which contained a float with several decimal points corresponding with a specific location. I collected this data set from the United States’ Department of Education’s College Scorecard, and downloaded it as a CSV. In order to measure similarity between data points, I did it based off of geographical location and standardized test averages. For example, if a user entered “Tulane University” into the input box, they would likely receive a plethora of schools with high SAT scores in the New Orleans, Louisiana area, with a few options in the Texas, Alabama, and Mississippi areas as well. The SAT scores are within 200 of the average of the school that the user entered (so an SAT average of 1400 would come back with a results range of 1200 to 1600). Below is a screenshot of the output when a user enters “Clemson University.”

As you can see, there are 10 results returned, most of which are in South Carolina (with the exception of the University of Georgia). NaN values are for schools who do not have values for SAT average. This could be for a variety of reasons, but the most likely of which is that they do not require an SAT submission in order to gain admission. Another thing I kept in mind was that the syntax of the school could get tricky, as the input box is case and symbol sensitive (in other words, if a user has a dash or a space in the wrong place, it will not be able to deliver the desired output). To mitigate this issue, I added a cell above the input code.

In this cell, a user can enter in the city name of the college they want to input. For example, the image above shows a user entering “College Park” to see what the name of the University of Maryland — College Park’s syntax should be. They can either copy the name, type it out, or copy the UNITD column, a unique numerical identifier that each school has, into a different cell below to obtain the same results. To do the above metrics, I used Euclidean distance, as I figured that that was the most logical method considering I was working with geographical data. As a student at the University of Maryland, I wanted to determine the top 10 schools that had the closest similarity metrics to another. The results came as follows:

  • University of Maryland-College Park in College Park, Maryland
  • George Washington University in Washington, DC
  • American University in Washington, DC
  • St. John’s College in Annapolis, Maryland
  • Washington Adventist University in Takoma Park, Maryland
  • Montgomery Beauty School in Silver Spring, Maryland
  • Bennett Career Institute in Washington, DC
  • The Catholic University of America in Washington, DC
  • Pontifical John Paul II Institute for Studies on Marriage and Family in Washington, DC
  • Pontifical Faculty of the Immaculate Conception in Washington, DC

As you can see, the top five schools listed are much more mainstream, larger schools than the last five. The career goals of people attending the last five outputs are likely to be much more niche than the first five, as it appears they are geared towards people pursuing careers in cosmetology or religion. This is the main bug that a user may encounter. Many times, people looking for schools to apply to are not looking for schools similar to the ones listed at the bottom. The lack of SAT score being listed appears to be an indicator of the type of school it is, so for future work I may try and remove the null values from the data to see where it gets me. Another bug that could inconvenience a user is the specific syntax that is required for the input. Any simple syntactic error will trigger a “College Not Found” output, which could get frustrating for a user. The biggest limitation of my data that I can think of is the fact that the scope may be too narrow to some people. For example, a geographical location and academic similarity is just a small part of the equation for many future students. They likely care about many other aspects, such as majors and minors, athletics, clubs, Greek Life, campus, school size, and much more. There are tons of factors that go into choosing the perfect school, so this code is only the tip of the iceberg. While a useful tool, there are many other things that students should look into beyond this.

Below, I have attached a copy of my GitHub repository:

https://github.com/elliottbauer99/INST414/blob/main/Module%203%20Assignment.ipynb

--

--