Exploration of clustering techniques over geographical and other dimensions
This post explains attempts to define more homogeneous neighborhoods for a project about predicting gentrification in Philadelphia. For more context on the project as a whole, see this post.
Neighborhoods are an important characteristic of cities, many with evolving subcultures and lifestyles within them. Computationally, defining the boundaries of a neighborhood can be very difficult. Measures over predefined spatial boundaries can lead to a misrepresentation of the data, known as the Modifiable Areal Unit Problem (MAUP). Instead, clustering techniques can be used to delineate more internally homogenous regions for analysis. Clustering mechanisms like agglomerative hierarchical clustering and K-means, paired with Principal Component Analysis for dimensionality reduction on various housing characteristics, have been shown to significantly improve the hedonic analysis of housing price.
The objective for clustering in this model is to define more price-homogenous neighborhoods in hope to increase the predictive capacity of the average neighborhood price feature. The goal of clustering is to maximize in-cluster homogeneity, which translates to minimizing the within-cluster sum of squared errors.
Where c is a cluster and p is every parcel in that cluster.
Agglomerative hierarchical and DBSCAN clustering algorithms were compared to the baseline performance of the predetermined neighborhoods.
In clustering algorithms, the choice of distance measure is crucial. Geographical distance is of importance, since we want neighborhoods that are compact. Market value is the third dimension to take into consideration when defining distance. When combining the market value distance and the physical distance between parcels, both measures were normalized to a scale from 0 to 1. This essentially weighs the difference between the lowest and highest market value homes as the same as the spatial difference between parcels at opposite ends of the city.
Agglomerative hierarchical clustering, implemented in R with the ClustGeo package, takes in the number of desired clusters as a parameter and outputs the cluster assignments for each input point. At the beginning of its execution, the algorithm treats each point as its own cluster. Then, clusters are successively merged with other clusters or points based on proximity. The iterative creation of these clusters is tracked as the creation of a tree-like structure takes place. The tree, or dendrogram, is “cut” at the height that guarantees the desired number of input clusters.
The ClustGeo package allows for two distance matrices D0 and D1 to be passed in, adding a parameter that describes the fractional amount that each distance matrix is taken into consideration.
In this case, one distance matrix is the market value distance and the other distance matrix is the physical distance. After running a hyperparameter sweep over the mixing parameter and number of desired clusters K, =0.1, weighting market value more heavily than distance, was consistently found to minimize the within-cluster sum of squared errors.
With chosen, the number of clusters K can be chosen. For each K, the average price of the associated neighborhoods is calculated. The error shown below is the average within cluster sum of squared errors.
As is intuitive, the error goes down as the number of clusters increases. Each cluster gets smaller and is more likely to contain similarly priced houses. One common heuristic for choosing the optimal K is by identifying the “elbow”, or point of greatest deceleration, in the error graph. In this case, that elbow occurs at K=8.
Inputting the number of clusters desired may be counterintuitive, as the true optimal number of clusters may not be known. The Density-Based Spatial Clustering of Applications with Noise, or DBSCAN, algorithm does not predetermine the number of clusters. Instead, it determines the clusters based on two parameters: epsilon distance e and min points mp. First, core points are assigned as those with at least mp other points within a radius e. These points form the basic clusters. Then, border points are assigned as those with at least one core point within a radius e. All other points are noise and are not assigned to a cluster.
Compared to the baseline average error for house prices, none of the explored pairings of hyperparameters e and mp had an error close to that of simply using the existing neighborhoods.
Having the optimal hyperparameters set for both agglomerative and DBSCAN clustering, I took a sample of 10,000 of the parcels to perform the clustering and compare the results. I used Voronoi diagrams to extrapolate neighborhoods from the given cluster assignments on the small sample of points. I calculated the geographic centroid of each cluster of points in the same neighborhood and used the Voronoi algorithm to display the neighborhoods. This way, the cluster shapes are extrapolated beyond the bounds of the sampled points used to create the clustering. The results of the Voronoi diagrams are shown below.
Neither clustering algorithm was able to create neighborhoods that were more homogenous than the pre existing neighborhoods. This result has a couple of possible reasons, all boiling down to the measurement of distance and scale. There is no well-defined way to compare the two distance measurements, market value and geographic distance. While both need to be taken into account to make homogenous, geographically compact neighborhoods, the tradeoff is unclear. Even if we knew that geographical proximity is more important than price proximity, it is hard to specify that importance into a consistent scale. Scale may also not be consistent across different areas in the city. The density of parcels in Center City is much higher than that in the suburbs, which makes the epsilon distance parameter in DBSCAN hard to globally define. Lower epsilon values would discriminate against more spread out suburbs, creating more core points in Center City. Higher epsilon values would be too permissive and miss smaller clusters of similarly priced parcels. While data-driven clustering is a known problem in geospatial analytics, the relation between house price and distance must be studied more carefully before a standard scale can be used in these clustering algorithms.