Measuring Urban Similarities in Los Angeles using, Open Data, ArcGIS, and Sklearn.
Finding similar places
After exploring ways to measure happiness in cities, A Look at Spatial Happiness in Cities, using Tweepy, Text Blob, and ArcGIS, I continued to look for ways to use spatial data to glean important information about our urban surroundings. I was intrigued when a friend mentioned that, as an urban planner, she spent a lot of time identifying places similar to her project sites to serve as precedents. A precedent study is usually conducted as one of the first steps in urban planning projects. The goal is to identify applicable ideas from similar projects. If we could quantify the similarity between places, that would help urban planners find new and similar locations to serve as precedents. But its uses are not limited to urban planning. It would also help businesses and real estate. A retailer may want to find areas like those it has succeeded in already. A real estate agent may be interested in quickly identify areas that clients might like.
I decided to quickly build a proof of concept for such a tool. Since my friend’s use case piqued my interest, I began by thinking about the aspects of place that would define similarity for her precedent searches. When looking for precedent examples, urban planners want to find places that have similar urban structures (street network, built structure, land-use types, population density) and demographics (income, culture, age). In this article, I’ll walk through the methods and decisions I settled on to quantify neighborhood similarity, and demonstrate how data science methods can deliver helpful insights about our cities.
Measuring a Neighborhood
To identify similarities between neighborhoods, we must define a neighborhood. People know what a neighborhood is, but a consistent definition is elusive. For this project, I borrowed the 10-minute walk/ 1/2 mile radius frequently used in urban planning, and created a 1/2 SQMI grid to serve as my neighborhood unit. It is somewhat crude but is fine-grained enough to derive some helpful insights.
The next question to answer: what features should be similar for two neighborhoods to be considered similar? Do we care if both places have similar types of restaurants? Do we care about how the streets are laid out, e.g., in a grid or not? How about the politics of a place? Or something else? In this case, urban structure and demographics are the most important features to measure. If someplace has a similar structure and people, it is similar enough to use as a high-quality precedent. If you are looking for similarity in real estate, e.g., to identify places someone may want to buy a home, the important features will likely be different. Maybe you would care about school quality or park space. I don’t know so, if you work in real estate, please let me know!
Having decided to focus on similarities in structure and demographics, I set out to collect the needed data. To keep the workload manageable and avoid potential data inconsistencies, I focused my tool on Los Angeles County only. It’s the most populous county in the country and has a very deep catalog of open GIS data; it is a good place to start.
Urban Structure Data
Based on my experience with urban planning, I selected four features that speak to urban structure: population density, intersection density, parcel density, and commercial land use density. Places that differ significantly in any of these features will likely feel structurally different from each other. Population density speaks to urban or rural character, intersection density speaks to active vs automotive transportation priorities, parcel density speaks to urban scale, and commercial density speaks to local function.
To engineer the structural features, I pulled census and open urban data from the LA County Geohub into ArcGIS. Population density is derived using ESRI’s enrichment tool using population estimates from the 2018 American Community Survey (ACS), parcel density was created from the LA County Tax Assessors Parcel dataset, intersection density was derived from the LA County Countywide Address Management System (CAMS) street dataset, and commercial land use density was pulled from a subset of the Assessors Parcel dataset. All features were analyzed on the 1/2 SQMI grid.
To assess the demographic similarity between neighborhoods, I settled on four features that I thought exemplified cultural and experiential similarities between neighborhoods: median household income (MHHI), average age, percent of the local population identified as white, and the percent of the local population identified as Hispanic. Income significantly affects how a neighborhood behaves; therefore controlling for MHHI is important in finding neighborhoods that have similar behavior. Like MHHI, age can dictate how a neighborhood feels; a neighborhood with a lot of 20-year-olds feels very different from one filled with families or retirees. Finally, race can impact a neighborhood’s behavior and culture. I included two race features because neighborhood culture is affected, by both the size of different populations and especially the control each may exert over the community. A neighborhood with many white people and few minorities will feel different than one with an equal distribution of cultures, both those neighborhoods will feel different than a majority-minority neighborhood. This is a very complex relationship that is hard to fully distill from data but is important enough to neighborhood similarity to include.
All demographic features were pulled from the ESRI enrichment tool in ArcGIS and summarized to the 1/2 SQMI grid.
Now that the structural and demographic features are selected and derived for each 1/2 SQMI neighborhood in the LA County grid, it’s time to talk about exactly how to quantify similarity. Broadly there are two common methods used in data science to understand the similarity between samples: clustering algorithms can intelligently identify groups of similar samples in a dataset, and distance measurements that quantify the multi-dimensional distance between samples. Clustering is best if we are looking to group neighborhoods by similarity, and distance measures are better if we want a way to quantify how similar or dissimilar any two neighborhoods are. Both have their advantages, but for my friend’s situation, where we want to find the N most similar neighborhoods, distance measures will work better.
There are two common methods for measuring distance-based similarity: Euclidian distance and cosine distance. You probably met Euclidian distance in geometry class. Using the Pythagorean theorem, you can measure how far point A is from point B; we can extend that same concept to multiple dimensions. Cosine distance is another mathematical approach. It works best when the magnitude of the features may not be important. In our case, the magnitude is likely quite important. A neighborhood with a MHHI of $120,000 is very different from one with a MHHI of $50,000, size matters. With this in mind, we will use Euclidian distance to measure similarity.
Building the tool
Now that we have engineered our features and settled on a method for quantifying the similarity between samples, we can begin building the tool itself! For simplicity, I built the tool in a Jupyter Notebook, but I intend to move it into an interactive Dash/Plotly web app in the future.
The tool should work as follows:
- The user selects a 1/2 SQMI neighborhood that they would like to compare with other neighborhoods, in order to find similar neighborhoods.
- The tool finds N number of similar 1/2 SQMI neighborhoods in LA County in list and map form.
First, let’s take a look at our data and features to get an idea of what we are dealing with. I’ll load the data as a geopandas dataframe and then generate distribution histograms for each feature.
Looking at our urban form structural features, we can immediately see that all our structural data is skewed to the right. Meaning that there are far more small values than large ones. Interestingly, this tends to hold true for almost any urban characteristic and is one of the only hard and fast laws of cities. For more on this phenomenon check out Zipf’s law and Geoffrey West’s book Scale.
Looking at our demographic features, we notice that MHHI and age are far more normally distributed than the structural features were. This makes sense since we are dealing with data regarding humans, where, in biology, we expect to find normal distributions. However, the race features exhibit some of the skewness we observed in the structural features, which may indicate a structural aspect to race distributions. All of the features have significant numbers of zero instances. This can be explained by neighborhood areas without regular populations. There are a few communities in LA County that are specialized industrial areas, like Vernon and the City of Commerce that might account for some portion of the zero instances, and swaths of low-density areas in the far north of the county, that could account for more. We will keep them in the dataset, despite their appearance as outliers, because they represent true on the ground conditions and there is no requirement for normality for this analysis.
Scaling and weighting
Before we compute our similarities, we need to scale our data. Since we plan to use Euclidian distances to quantify similarity, we need to be mindful of differences in scale between our features. Let’s use the pandas describe method to get a quick understanding of how our datasets differ in scale.
By inspecting the descriptive statistics of each feature, we can see how the ranges of each vary vastly from one another. This is a problem when using Euclidian distances because the formula assumes consistent units in all dimensions. To illustrate this, let’s assume we are only finding the similarity between 2D points with a population density and age value. Sample 1 has a population density of 7000 and an age of 60 and Sample 2 has a population density of 7500 and an age of 36. Using the formula described earlier, the Euclidian distance is 500.57. You’ll notice that if we disregarded age completely, we would have found a distance of 500. The reason these are both so similar, even though we know that a place with an average age of 60 would be quite different than one with an average age of 36, is because the scale of the population density feature is so much larger. It completely washes out the important differences in age. Thus, we must scale our data so that all our dimensions have consistent units. We will use the SKLearn Standard Scaler method to do this.
You’ll notice that I also included a weights input for MHHI and ethnicity. This is because, once you scale your data, you can control the impact each variable has on your analysis. After scaling, each feature will have the same impact on similarity as every other feature. This may be what you want, but sometimes, like in our case, there are a few features that you want to make sure are heavily considered. In this case, I felt that dissimilarity in MHHI and ethnicity, in regard to neighborhood behavior, might overpower similarities present in other features. Two places with similar commercial and population densities, for example, would behave similarly only if ethnicities and MHHI were similar as well, if they were structurally similar, but had different ethnicities and incomes, I would not expect them to behave all that similarly. So to account for this, we can choose to weigh income and ethnicity higher than other variables. At present, I have given them both a 5X weight, meaning that a one-unit change in either is only offset by at least a five-unit change in an unweighted variable.
Once our inputs are scaled and weighted, we can calculate Euclidian distance similarities. First, we must take the input neighborhood that we wish to find similar neighborhoods to. Since this is built in a Jupyter Notebook, I just take the neighborhood of interest as the Object ID referencing the selected Neighborhood in the data.
Then we compute the Euclidian distances between the specified neighborhood and all other 1/2 SQMI neighborhoods in the dataset. To do this we use the SKLearn paired_distances method.
The code above returns the 100 most similar neighborhoods to the neighborhood selected. With this data in hand, we can use geopandas to plot them on a map with only a few lines of code.
On the map, you can quickly see the areas most similar to the selected neighborhood. You can further interrogate the results by looking at the returned data frame.
Looking at the similarity data frame we can see how well our Euclidian distance similarities work. At the top of the image, you can see the feature values for our input neighborhood, and below, the top 10 most similar neighborhoods. The first is the selected neighborhood, so it should be ignored. But starting with the second we can see how our similarity quantities, “sim” line up with the similarity of our structural and demographic features. The most similar, with a Euclidian distance similarity of 2.68, OID: 3228, is a little less dense but has a fairly similar intersection density and is extremely similar in terms of MHHI, ethnicities, and age. When we inspect the output, we can see that, as designed, the tool weighted income and ethnicity more than the structural features. Weights, in the future, would be likely best determined by the user. But overall, it seems to have done a good job finding similar neighborhoods!
This was certainly just a proof of concept, but it provides a valuable starting place for further work. In the future, I think that the feature set can be expanded to include far more structural and demographic features. The tool could be packaged in a web app to improve usability and enable the user to tune the analysis for their specific use. It will also be important to expand the tool beyond the LA Region. However, further analysis will have to be undertaken to determine how well one can compare neighborhoods across diverse regions.
As always, I love getting feedback. I hope to start a discussion about how we can better measure and analyze our cities!
Please feel free to comment on this article, send me an email, or connect with me on LinkedIn. I’d love to hear your comments and suggestions!