Identification of Ideal Locations for Student Accommodation

Data science approach to identify best locations around Nottingham Trent University to start student accommodation business

10 min readAug 17, 2021

This is part of my IBM data science course I completed using Folium, Foursquare API, Scikit-learn and Pandas — plus other necessary libraries such as Matplotlib, numpy etc. Complete Jupyter notebook on NBViewer is available here for rendered experience. Also, the notebook can be viewable on my github as well, however some figures may not render properly.

Business Problem

Identify ideal locations around Nottingham Trent University (NTU) such that the locations have access to a wide range of amenities, are safe and are in close proximity to the university (within a 5 km radius).

In this project, I utilise Foursquare API to explore neighbourhoods around NTU to provide consultation for the best locations for investing in student accommodation. The solution provided will be useful for business owners to choose locations around NTU to provide accommodation services to university students. Mainly, the availability of facilities around the locations and the number of criminal events reported are taken into account. Usually, students prefer to live in close proximity to university campuses so the locations are restricted within a 5 km radius of NTU.

Background

Nottingham is a city in central England’s Midlands region in the United Kingdom. There are around 65,000 students at Nottingham’s two universities — the University of Nottingham and Nottingham Trent University. With a huge student population, Nottingham is one of the most vibrant cities in the UK. Nottingham is ranked as the 6th best city in the UK for students and 48th in the world, according to the QS Best Student Cities 2019. Due to a large influx of students at the beginning of each academic year, university halls of residence are not enough to accommodate all the students. Also, most students prefer to seek residence via private student accommodation due to affordable rent and wider options in terms of location and housemates. This opens up the business opportunities to invest and explore private accommodation services to the university students ensuring maximum safety, accessibility to amenities and minimising distance to the University.

Datasets

Postcode data for Nottingham: Geolocation dataset for NG postcodes in Nottinghamshire was downloaded from here. The dataset was further cleaned and narrowed down with only necessary fields before using with Foursquare API.
Nottingham Crime Data : Crime data is publicly available data downloaded from data.police.uk. To keep the problem simple enough, I computed a total number of all crimes reported for each location. This information was then matched for each location from the postcode data.

Data wrangling and feature selection

The data often requires cleaning and processing further based on the problem one is trying to tackle. Hence, wrangling the data and preparing a final set of data is essential for data science problems. This section covers data cleaning, processing, organising, modifying and preparing it for analysis.

Postal code data

Postcode data for Nottingham resulted in a table with 49 features and 37,461 entries. However, there were many missing data and we do not need all the features to make suggestions for the best locations. The features that are of interest are [`Postcode’, `In Use?’, `Latitude’, `Longitude’, `District’, `Postcode district’, `Ward’, `LSOA Code’]. `In Use?’ feature shows if the postcode is currently in use or not. The entries were discarded if this field was `No’ and this feature was dropped as all the entries are the same after discarding `No’ entries.

The postcodes are divided into two parts: (i) outcode — NGxx (ii) Incode — XXX. Postcodes that are very similar are in close proximity to each other and doesn’t bring enough variation in the data. Hence, similar postcodes were categorised as area code and a new feature under `Area_code’ was added to the dataset. The area code is composed of Outcode and the first part of the incode. For example NG1 5, NG11 8, etc. The data cleaning was completed for postcode data which were used later with crime data. The cleaned table with featured Area_code is shown below.

Crime data

To measure the safety of the area crime data based on the reported crimes from each area is used. To simplify the process, the latest data available at this time(August 2021) is for June 2021. The dataframe consisted of 12 features and 12592 entries. The features were [`Crime ID’, `Month’, `Reported by’, `Falls within’, `Longitude’, `Latitude’, `Location’, `LSOA code’, `LSOA name’, `Crime type’, `Last outcome category’, `Context’]. There was no postcode information but the LSOA code can map it back to the postal area. `LSOA code’ and `Crime type’ were extracted to compute a total number of crimes reported for each LSOA code.

The next stage was to combine this information with the postal code such that the final postal code data also include total crimes reported for each postal code area. The crime dataframe was merged with postal code data based on LSOA code on both data frames. The final dataframe screenshot is shown in Figure below.

The total crimes feature was normalized using the min-max method. The normalized values were then converted into categorical scores [1: 8] based on the values. I found simply categorising as `Low`,`Medium` and `High` made it difficult to find an optimum number of clusters with the elbow method due to only three categorical features. So, I have divided into 8 different scores by diving the normalized values into ranges from 0 to 1.

University data

Since I have decided to work with Nottingham Trent University due to its central location, the postcode was extracted from the university website and the coordinates were extracted using geocoder API. Based on the latitude and longitude of the university, all the area outcode within a 5 km radius were identified and appended into the university data frame. Finally, this data frame was merged with the combined dataframe of the postcodes and the crime data. See below a snapshot of the merged table.

Exploratory Data Analysis

So far, I have discussed how the data are cleaned, processed and new features are created. In this section, I explore the dataset with exploratory analysis before proceeding with the cluster analysis.

Analysing the descriptive statistics of the number of crimes reported shows that the mean crime reported was 28.17 with a standard deviation of 35.81, the minimum number of crimes reported was 1 and the maximum was 264. The distribution is shown with a box plot in the figure below.

Distribution of the number of crimes reported for each LSOA code

The boxplot distribution shows that the maximum values are outliers in the plot. These values have a significant shift from the distribution indicating a disproportionate increase in the number of crimes reported. These areas are not a good choice due to safety concerns.

Further, I explored the top 20 areas which have the highest and the lowest number of reported crimes in Nottinghamshire. The figure below shows a horizontal bar chart for the top 20 safest and top 20 most crime-ridden area codes in Nottingham.

20 most and least crime ridden areas in Nottingham

As we can see that the top 2 most crime-ridden locations have more than twice as many crimes reported when compared to the 4th most crime-ridden place. The rest of the area codes have gradual decrement so these areas have comparable criminal activities. Keeping the area safety in mind, the top two locations were discarded which have more than 200 crimes reported in a month.

As already discussed in the previous section, the reported crime numbers were normalized and then categorised into score based categories[1:8] where 1 represents safest and 8 represents highest criminal activities. Exploring the number of locations that fall into each category is shown in the figure below.

Area categorisation based on the number of reported crimes

The figure shows that most regions in Nottinghamshire fall into safer categories (scores 1–5). Most areas fall on the bottom half of the scores, ranging from 1–4, which shows that Nottingham is mostly a safe place.

I used the Folium library for visualizing geospatial data. i.e. location of areas on the map. The visualisation helps to (i) validate the locations, (ii) area coverage and (iii) where the area is in respect to the university. Figure below shows a geospatial image of Nottingham Trent University and surrounding areas.

Geospatial map with NTU at the center of the map. A red circle with a radius of 5 km is drawn to show the area of interest around NTU

The figure shows locations around NTU where each blue circle indicates an area code.

A geospatial heat map based on the number of crimes reported in each location in Nottingham

The heatmap geospatial figure above whereas shows a heatmap based on a number of crimes reported. Based on the heatmap, there are small pockets of areas with a very high number of crimes reported. The figure helps to obtain a very good idea about how the settlement areas are spread out and where the most crimes were reported.

After deducting the two most crime-ridden areas and narrowing down the locations within 5 km of NTU, 59 area codes were identified. The rest of the further analyses were then conducted with these locations to find out which locations would promise the best options to the business owner.

Results

I used Foursquare API to explore venues around each location and make a further recommendations on which locations are the best for opening a student accommodation. Availability of venues plays a key factor in student’s life so the more venues a location has the better choice that location is for students. Foursquare API returned 152 unique categories of venues. The top 20 venues are shown in the figure below.

The top venues show a wide array of different venues. To perform cluster analysis based on the crime scores and venues, these features were encoded with `One-Hot’ encoding. With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns.

Cluster analysis

K-means clustering was employed to cluster the locations based on the venues categories and crime scores for each location. To find the optimal number of clusters, the elbow method was performed. The Elbow method shows that the optimal number of k was 5 after which the errors didn’t decrease as drastically as shown in the figure below.

Elbow method to find optimal number of clusters

I used this k as an optimum number of clusters to cluster areas within 5 km of NTU. The clusters are visualised using folium library. Geospatial figure below shows colour coded clusters of areas around NTU.

5 colour coded clusters of Nottingham area around NTU

Although, we can visualise clusters, it is not apparent whether one cluster is a better choice than others just on the visual inspection. To compare different clusters, I computed number of venues on each cluster. This provides a good idea about which cluster has good amenities in close proximity i.e within 500 meters of cluster members. Figure below shows that clusters 1 and 3 have significantly higher number of amenities in close proximity. These two clusters are the best options to further investigate into for the best accommodation facilities for students.

Cluster comparison based on the number of venues and crime occurrence

The postcode areas within each of these clusters are listed in the table below.

These postcode areas would be the best locations within Nottingham around NTU to look into to establish student accommodation business.

Conclusions and Discussions

In this project, I used cluster analysis of area codes in Nottingham within 5 km distance from Nottingham Trent University to identify ideal places which could potentially be suitable for establishing business providing student accommodation. I used publicly available postal code data to extract location information and I also generated new features for postal code as Area_code based on closely situated locations and for crime data. I used crime data to extract the number of crimes reported for each location and prepared a new data frame based on this information. These two data frames, when combined, provided us with a good overview of the safety of each area. Geospatial maps helped to visualise locations on the map. Another important feature for the suitability of the locations is the availability of amenities in close proximity. I used Foursquare API to extract venues around each area and performed cluster analysis based on available amenities and the safety of the location. The cluster analysis resulted in 5 different clusters among which two clusters(cluster 1 and cluster 3) showed a significantly higher number of amenities. I further listed all the area codes within two of the selected clusters. I propose the solution that these two cluster locations offer the best choices for business owners to attract a higher number of students. The proposed locations offer better connectivity and safety.

As a note for further improvements, dividing the crime data further into different categories of crimes, adding transportation facilities and housing costs based on areas can help to improve the recommendation even further.