Studying Austin Gentrification

By Srikar Nalluri, Vamsi Ghorakavi, and Harrison Keane

Cities across the US are experiencing incredible growth due to economic forces and social urbanization. A side effect of this is gentrification, a process by which middle to upper income residents are moving in to newly developed areas and displacing current lower income residents either by directly taking apartments or by increasing the cost of living. Even though gentrification is easily recognized by walking around and observing the level of development and change, it is incredibly hard to numerically quantify.

For this project, we try to predict gentrification in the Austin area purely based on numerical data obtained from the Census’ American Community Survey. In the process of this task, we explore patterns observed in the data, tackle the challenge of defining a numerical basis for gentrification, and attempt to create a model that predicts gentrification two years in advance. Despite in depth exploration showing patterns that suggest population changes that result from gentrification, we were unable to find any signals in our data that acted as precursors to these patterns emerging. While we may have been unable to model this phenomenon numerically, we have included next steps that would encompass both relative and qualitative features to attempt to counteract our shortcomings.


Background

Austin, Texas is the fastest growing major city in the United States. While cities generally desire growth, Austin’s sustained status as one of the fastest growing cities in the country since the 1990s has led to the city becoming less affordable for residents. In the ten year period between 2006 to 2016, home prices increased in Austin by 65%, which is more than any other city in the country. From 2017 to 2018, Austin also experienced the largest increase in cost of living of any major city in the US. The income required to live “comfortably” in Austin increased 33.92 percent year over year.

In addition to an increase in housing prices, neighborhoods in Austin are experiencing a shift in demographics. Historically the enforcement of racial segregation through city zoning and mortgage redlining forced minority and lower income neighborhoods to be located in the east side of the city. These neighborhoods are especially vulnerable to rapid increases in pricing as areas close to the central city become more desirable to higher income populations. This process of increasing housing prices and shifting demographics has come to be known as gentrification.

Last year, the City of Austin worked with the University of Texas to study gentrification in Austin. The landmark study by the UT Center for Sustainable Development & the Entrepreneurship and Community Development Clinic identified neighborhoods in Austin that were at risk of experiencing gentrification, the severity of that risk, and possible ways to mitigate the impact.


Project Summary

The goal of our project is to investigate the factors associated with gentrification and use them as the basis for a model to predict whether a neighborhood will experience gentrification. While existing research identifies areas currently experiencing gentrification, our approach is novel in that it attempts to predict areas that may be at risk in the future without manually labeling them as such based on expertise domain knowledge. The subgoals of our project involve defining gentrification or using alternate features as proxy for gentrification. Some targets we attempted to train on included rent as a percent of income and median gross rent. The relevance of our project includes potential social indicators for legislation to curb the negative side effects of gentrification, as well as the potential investment value for predicting areas of large increases in rent or land value.


Data Collection

We collected our data from APIs provided by the United States Census Bureau and the Department of Housing and Urban Development. The first step was to sign up for a developer key from the Census Bureau, which grants us access to their API endpoint. After we were approved for access, we began collecting data from the 5-year American Community Survey (ACS). To get the ACS data using API, we must specify what level of detail we want the data at through parameters in our search queries. Since we are looking at neighborhoods within Austin, we chose to obtain data at the most detailed level we could, the census tract level. The GEOID also specifies the area we wish to cover using a standardized federal government code known as a FIPS code. In this case, we used the FIPS code corresponding to Travis County, Texas.

from census import Census
from us import states
import censusdata as cd
import numpy as np
years = range(2009,2017)
for year_ in years:
c = Census(CENSUS_API_KEY, year=year_)
features = ['NAME']
features.extend([str(x) for x in pd.read_csv('features.csv')['KEY']])
if(year_ < 2015):
features.remove('B25034_011E')
data = pd.DataFrame(c.acs5.get(features, geo={'for': 'tract:*',
'in': 'state:{} county:453'.format(states.TX.fips)}))
data['geoid'] = data['state'] + data['county'] + data['tract']
data = data.set_index('geoid')
data.to_csv('census_data_feature_selected'+str(year_)+'.csv')

The datasets contain many different categories of features that are relevant to the study of residential gentrification. These include variables related to:

  • Housing (Values, Asking Prices, Homeownership, Rent)
  • Income
  • Demographics (Age, Race, Gender, Tenure, etc.)
  • Transportation (Commute Times, Car Ownership)
  • Amenities (Parks, Public Safety, Schools, etc.)

In order to find the best features to train on, we found the most effective (although tedious) method was to read through variables available and collect the aggregates of each feature (i.e. only the total median income rather than median income by gender, age, etc…).

Once we reduced the obscene number of features in the ACS survey to roughly 20 per year, we were able to start working on cleaning the data. We dropped three census tracts from Travis County due to their lack of information: 9800 and 2319 were zoned for the airport, while 1606 was dedicated to a State Supported Living Center. For any null values or invalid (negative numbers), we replaced the data with the mean. For latitude and longitude, we used the interpreted center of the shapefiles of each tract provided by the census data portal.

Once the data had been collected and cleaned, we could begin our Exploratory Data Analysis and attempt to find the best target to serve as a proxy for gentrification.


Data Exploration

Gentrification is a process that occurs over time. Since the most consistent data (in terms of variable naming and time resolution) was the ACS data from 2010–2016, we decided to analyze patterns on a yearly basis over this period. This is in contrast to other work that uses the true 10-year census data from 1990 and 2000 to predict 2010 values (thus only three points in time).

We decided to first visualize some of the features defined as indicators of gentrification by overlaying them on a map of Austin. In order to do this, we needed to connect to a different Census API called TIGER. This API provides shapefiles that define the geographic areas tied to a census tract or block group number. Once we had the shapefiles, we were able to feed the data into a plotting tool called Social Explorer to create heatmaps of features.

One of the most interesting features to us was the Rent Represented as a % of Gross Income. The generally accepted rule of thumb tends to be that rent should be roughly a third of your income. Using a heatmap of Austin with this feature, it is clear that this rule is widely disregarded.

Furthermore, we wanted to understand how this feature changed over time. For example, would a gradual increase in rent-income represent increasing rent values or decreasing income? Contrarily, would a decrease of percent of income represent higher income or lower rent? Finally, would a rapid change in % income used for rent represent a new population moving in, since it would be unusual if the majority of people in a census tract all experience a large increase in income. To explore these questions, we made several plots comparing % income used for rent over time for individual census tracts, comparing with the mean and showing the standard deviation for all census tracts.

For example, here is Census Tract 11, representing downtown Austin. Here the rent as a percentage of income is fairly low over time. Obviously domain knowledge suggests that this is not due to the low rent (it is downtown after all), but more likely due to the income level. Except for the sharper increase from 2013–2014, the changes from year to year are also gradual, suggesting that the population remains rather homogenous in terms of income level from year to year.

Comparing this census tract with Census Tract 10 (East Cesar Chavez, East I35), you can see far more dramatic changes in rent-income ratio. For example, from 2015–2016, you see this census tract that normally used more than 35% of income on rent drop dramatically down to less than 30%. This is also after a gradual increase in rent. We believe gradual rent-income ratio increase could be caused by either the existing population to have to use more of their income on rent — until they were displaced by a different population that could afford that rent — or by legislation preventing rent price increases.

A third example is census tract 18.46, which is across from the Domain (an area in Austin of recent rapid growth). Here you can see that until 2013, most residents were using over 50% of their income on rent. Yet in three years between 2013 and 2016, this number dropped dramatically to 25%, suggesting either more housing supply, or a higher income population moving in.

Finally, this is a comparison between the rent value and the rent-income ratio of Census Tract 23.04, both standardized by the sample mean for that feature. Here, the pattern is both dramatic and unintuitive. From 2015–2016, you see a sharp increase in the rent price, and a decrease in the rent-income ratio. Once again, since Rent-Income Ratio = Rent / Income. If Rent increases, so should the ratio. If it decreases, it is a clear sign that the income must have dramatically increased, due to a higher population moving in. This is verified by knowing there are a set of new developments who opened their apartments in 2015, all of a sudden increasing the rent price of that area and income of the new residents (assuming that those who can move into a brand new apartment will tend to have higher income than those who cannot).

Another feature we looked at was the percentage of housing units occupied by renters. In neighborhoods where less people own their own homes, residents may be more vulnerable to displacement.

In addition to visualizing data related to housing prices and income, we also visualized the change in demographics for census tracts. We wanted to look for tracts that stood out from the others or closely correlated with the rent visualization to provide a stronger indication of gentrification occurring. Once again, we compare the change in features over time of Census Tract 11 (Downtown) to Census Tract 23.04 (Lakeshore/Riverside).

As can be seen for downtown, the hispanic population makes up approximately 10% with very little change over time. Comparing this to Lakeshore/Riverside we see a gradual decrease of the proportion of Hispanic people until 2014, after which it sharply drops off. To emphasize the scale, Tract 23.04 went from over 80% Hispanic, to roughly 40%.

If you recall the graph from the rent-income ratio to the rent value, tract 23.04 also experienced sharp changes from 2014–2016. While we are not suggesting a causality, we believe this correlation is an obvious indicator for new development that is displacing current populations.

A clear challenge of this task was the sheer volume of data due to the lack of observations (only 215 usable census tracts) compared to possible features (at a minimum, we selected 102 from a set of over 300, which was already selected from a total set of 20,000 per year). We have clearly hit the curse of dimensionality. The first thing we tried to explore is how much information was actually in all of our features as well as if it would be possible to do a low rank approximation (although it would not be a very interpretable model). By standardizing the features and performing SVD, we were able to plot the ordered singular values of the transformation. As can be seen by the plot below, there is still a lot of information contained in at least the top 20–40 directions, and thus we do not believe that a low rank approximation would achieve any significant improvement.

Due to this high dimensionality (and also after the failure of simpler models), it became apparent that we would have to use models that are well equipped to deal with large feature spaces such as support vector machines and random forests.

Finally, we singled out 2016 to find if there were any dependencies from between the census data we were using by using a pairplot. In this intermediary pairplot, you can clearly see linearly dependent features that had to be removed, and others that had to be transformed due to skew.


Modeling

As explained earlier, we opted for XGBoost due to the high dimensionality of the data. In order to use a proxy for gentrification, we tried to predict rent prices in 2016, using only data from 2014 and prior. Using this model, we were able to achieve an RMSE of 234.5 on the rent value data. Unfortunately upon further inspection we found that the standard deviation of the data was 298.95, and the residuals were in fact linear with respect to the target. The linear residual plot suggested that the model had learned nothing as the error was directly related to the difference between the prediction and the mean. Thus, our model simply predicted a value near the average of the training set, which is equivalent to having learned nothing.

Furthermore, this model found that the most important feature for predicting rent in 2016, was the rent in 2014 with a feature importance of .68, whereas most other features were below .1. Thus, we believe this model was completely disregarding most census data and focusing instead on the time series of previous prices.

In addition to regression, we believed that perhaps separating the problem into a class problem may result in better performance. To do so, used various rates of change in rent (i.e. 2014–2016, 2010–2016, both thresholded at values of increase > 25%, > 50%, > 100%) to separate the tracts into a binary classification problem. For example, here is a visual of the rent-income ratio in 2010 (xaxis) vs. 2016 (yaxis), where the orange markers are major changes, and blue is roughly the same.

Once we had split the data into two classes, we tried models such as SVM, XGBClassifier, Random Forests, and standard Logistic Regression. We believe due to the dataset size, almost all of these models performed poorly using 3-fold CV and average ROC scoring as a metric, in some cases getting a random classifier, and occasionally less than .5 on ROC.


Conclusion

While the exploratory data analysis seemed promising, we were unable to find a model that reliably predicted rent price increases based purely on census data. Upon analyzing all attempted models, it seemed that no information was really learned from the census data; thus, we must have been missing the true predictors of gentrification.

One major issue we have realized is that all of our models are trained on isolated observations. That is to say, each census tract only contains data that pertains to itself and there is no spatial context for the model to learn. Spatial value is highly dependent on relative values based on what is surrounding it. For example, rent in one area may be reasonable compared to what is immediately around it but ridiculously high when compared to somewhere further away. Without this relative spatial context, our models would be unable to predict how far behind or isolated rent increases are. For example, gentrification may result in a census tract rent value increasing rapidly when compared to its neighbors.

Additionally, our models lack any form of social context. Since we only used census data, we did not have information on when new companies planned to move in town, or changes in legislation that either encouraged or prohibited development in certain areas. Since gentrification tends to be a result of human planning and development, this limits the effectiveness of our modeling. Without including social context in our model, our model would miss market factors such as Apple announcing a massive new office campus in North Austin.

To account for this, we believe that including spatial relative features could be promising, such as each feature compared with the average of the neighboring tracts. Additionally, we could include amenities and employer information, such as where the locations of major companies are and the type of work near each tract. It could be possible to find the cost of living within each census tract to map how locally dependent a tract is (i.e. does it depend on local affordable companies, or are there large chains such as Starbucks and Whole Foods).

Finally, our goal seemed similar to anomaly detection due to the nature of the gentrified tracts versus the normal tracts. We were unable to get traditional anomaly detection algorithms working due to the time constraints of our project and the dimensionality of our feature space. To fix this, we could include data from other cities to get more data on anomalies; however, we would need a way to adjust for characteristics of each city and how each city experiences growth.

In conclusion, our goal was challenging. While it would have been easier to perform purely retrospective analysis, we attempted to predict gentrification two years in advance by holding out data for the two year window preceding our attempted prediction year. Thus, even though our EDA showed potential patterns and correlations in the data, in the end it seemed there were few predictive signals in the features we selected from the census data. Moving forward, our next steps would be to address the shortcomings in our model by accounting for spatial and social context as well as broadening the scope of our data.


References

[1] “Record-setting Austin home prices could ‘normalize’ in 2017,” KVUE, 08-Dec-2016.

[2] J. Anderson, “The Cost of Living Is Quickly Rising in These 20 US Cities,” GOBankingRates, 03-Dec-2018. [Online]. Available: https://www.gobankingrates.com/making-money/economy/cities-cost-living-rising-fastest/

[3] H. Way, E. Mueller, and J. Wegmann, “Uprooted: Residential Displacement in Austin’s Gentrifying Neighborhoods and What Can Be Done About It,” rep., Sep. 2018.

[4] K. Steif, A. Mallach, M. Fichman, S. Kassel, “Predicting gentrification using longitudinal census data”, Urban Spatial Analysis. [Online]. Available: http://urbanspatialanalysis.com/portfolio/predicting-gentrification-using-longitudinal-census-data/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade