Using regression models to predict per capita and median household income in NYC

Published in

Alien Status

11 min readApr 6, 2018

Before we delve into the nitty-gritties of this blog, we should mention that this is a requirement for our Data2020 midterm project. However, having said that, we also realized that this topic is useful for all those big dreamers out there looking for a break in the Big Apple, and so hope everyone reads and enjoys this.

Project Introduction

Well, in a way, we’re doing this because we have to! But this is a good medium (no pun intended) of applying our learnings from the Masters’ program that we are all enrolled in.

We are attempting to develop robust regression models to predict the household median income and per-capita income using advanced regression techniques such as, but not limited to, cross-validated ridge and lasso, best subset selection, linear regression, generalized linear regression and multi-level models. Since the data consists of census tract spatial data, one of us aliens was extremely excited and created a bunch of cool visualizations.

So, keep reading and look out for those visualizations, they start popping up very soon!

Explanatory Statistics

We were provided two data sets. NYC Census Data consisted of census data such as gender, race, income, unemployment etc. sorted by census tract. NYC Location Data consisted of spatial data in the form of block codes, latitudes, longitudes and county, sorted by county.

Knowing what each data set represented, we decided to make sense of our datasets. We thought that the geography of NYC might be a factor in determining income levels and so decided to plot the distributions of our desired outputs, i.e. income per capita and median income, by the county and borough. As can be seen by Fig 1 and 2, the distributions are exactly the same for county and borough implying that there is a one-to-one relationship between the county and borough. Hence, we decided to keep drop the borough columns and plot the distributions of income per capita and median income together, as can be seen in Fig 3.

Fig. 1 — Distributions of Income Per Capita by County and Borough

Fig 2 — Distributions of Median Income by County and Borough

Fig 3 — Distributions of Income Per Capita (solid) and Median Income (dashed) by County

The next step was to explore whether gender affects income level in NYC. When fitting a simple linear regression on median income and income per capita using the population of men and women, we notice that there is no clear distinction as the fits look the same, as can be seen in Fig 4. This is because we are using the raw number of people in a particular gender category as a predictor. Operating under the assumption that there is, unfortunately, a significant discrepancy between wages for men and women the only way gender can be used to look at income is if we look at proportions of men and women (that is — create a new variable which represents men/totalPop and women/totalPop.

Fig 4 — Fitted Linear Regression for Median Income and Income Per Capita for Women (Red) and Men (Green)

We decided to take this one step further and regress median income and income per capita on the fraction of men and women in the total population, which resulted in more distinct fits, as can be seen in Fig 5. This confirmed our assumption seeing as both per capita and median household income increased as the proportion of men went up to 100%. The regression line for women is exactly inversely proportional (they sum to 100%). In any event, we should choose to use either one but not both. However, looking at the plots and R-squared we can also see that a simple linear regression does not capture the data too well.

Fig 5 — Fitted Linear Regression for Median Income and Income Per Capita for Women (Red) and Men (Green) as a Fraction of Total Population

We decided to delve further and see whether the racial groups had an effect on the income levels. Following a similar approach to the one for gender, we fit simple linear regressions on median income and income per capita with the proportion of each race as the predictor variable to notice that there are distinct levels of median income and income per capita for each race. For example, we can see a clear inverse relationship for income for white and hispanic people. Coincidentaly, those two groups have the steepest slopes and the largest R-squared values.

Fig 6 — Fitted Linear Regression for Income Per Capita and Median Income by Proportion of Races

After looking at demographic trends we wanted to see what we’d find once we represented the data geographically. We created a few simple plots of the geographical dataset to see what we were working with.

Fig 7 — Plot geographical dataset to see data as per state and county

We noticed that the geographical data extends much further than the analytic dataset including parts of the New Jersey and counties we didn’t even know. Our first step was to reduce the geographical dataset to contain only our counties of interest.

Fig 8 — Plot geographical data for data points in NY only

As you can see in this plot, the census tracts in the geographic dataset are defined as points rather than polygons. The dataset is not unique by census tract, but by point. To fix this issue and aggregate our dataset to census tract level (to mimic the format of the analytic dataset) we polygonized each set of points so that we can represent each census tract as a individual geometric object in a geographic space (within a system of Euclidian coordinates).

Our next step was to create plots that would further explore trends in the data. For example, we looked at the relationship between unemployment, race and income.

In the plots below, the color of the polygon represents unemployment level, size of the dot — proportion of the given racial group in a census tract and the color of the dot- income level. All of the plots follow the same scale.

Even at first glance, certain trends become very clearly visible. For example, the red dots are mostly accumulated in the Manhattan area, however the sizes of the dots change between the plots indicating a very large white population, a smaller Hispanic population and a significantly smaller African-American population. Hispanic population appears very large in Queens, Brooklyn, and most significantly the Bronx — all of these areas have mostly yellow and orange dots indicating much smaller median household income. Lower income black households seem to gather in the Bronx and Kings and Queens. There appears to be a concentation of black household with mid (orange) to high incomes in the area around JFK.

Seeing as both dot sizes and colors are not randomly scattered around the map, but seem to accumulate in certain areas we can tell that there seems to be a county-dependent relationship between race and income.

Fig 9 — Plots showing unemployment and income levels according to race (White, Black, Hispanic), overlaid onto the map of NY

To explore these visualizations interactively for each of the above demographics use the the following links:

You can click on the individual dots or census points to find out more. The visualization is overlaid on top of a simple map — press the square button on the top left below the zoom in/zoom out options to switch to street view (OpenStreetMap) or topographic map (Esri.WorldTopoMap).

Following our exploration of geograhical trends for median household income, we decided to see how they relate to income per captita. The plots below illuminate the relationship slightly. The differences between dot sizes are much larger for income per capita which is consistent with figure 3 (way above) which compared the distribution of median and per capita income by county — the median incomes were centered around higher values and their distributions were more spread out.

Fig 10 — Median Income and Income Per Cap by County

In order to efficiently compare the different variables and represent them graphically we decided to look at majority values for each census tract. For each group of variables that sums to 100% (race, job types, commute types, and employment types) we created a individual variable that reflects which category occurs more often.

Fig 11 — Income Per Capita levels by the most occurring groups (racial, commute)

Once again we can see that there are clearly areas that are mostly populated by a particular racial group, with the highest-income area being Manhattan mostly populated by white people. Most other areas had significantly lower per capita income (smaller dots). The Bronx has a Hispanic majority, upper Queens-a large Asian population, Brooklyn and lower Queens with a large African American population and lastly Staten Island dominated by lower income white people.

The commute graph is somewhat intuitive — people living in census tracts closest to central Manhattan have a majority of walkers, probably because they work in the Manhattan area. The next “wave” of census tracts had a majority of people using transit and then census tracts furthest from Manhattan had a majority of people who drive to work.

Fig 12 — Income Per Capita by the most occurring groups (job type, employment type)

The job and employment type graphs are somewhat less diverse. The job graph shows that the majority of census tracts throughout our area of interest have people working in management, business, science, and arts. Some census tracts in the Bronx, Queens, and Brooklyn had a majority of people working in service jobs and then some office jobs. Production and construction majorities were much less visible.

Lastly, employment type seemed monopolized by private work with a single census tract having a majority of self-employment and another individual one — a majority of public work.

Model Selection

Having done our exploratory work, we decided to drop the borough column and convert the county column into dummy variables. However, our data set still had over 25 predictor variables with high potential for collinearity. Hence, we plotted a correlation visualization (probably the only pretty looking visualization in the model selection section) to help us drop those predictors that are highly correlated to other predictors, which can be viewed in Fig 13.

Based on this, we decided to drop CensusTract, Men, Women, Citizen, White, TotalPop, IncomeErr, IncomePerCapErr, ChildPoverty, Service, Transit, Professional, and PublicWork.

We split up our data set into a 70% training set, which we would use to train our models, and 30% testing set, which would be held out and used solely to test our predictions. Our metric for comparing models is going to be the Root Mean Squared Error (RMSE) between predicted values and test values. This is a good measure to check for performance of the model on the held out test set, thereby seeing how well the model predicts on values that it hasn’t seen before.

We then attempted several combinations of Principal Component Analysis (PCA) and subset selection to develop the best set of features.

After several attempts using linear, gamma, poisson, cross validated ridge and lasso, we soon (not so soon) realized that PCA does not work well as it gives us high RMSE values. This makes sense because we have already removed highly collinear predictors. Using PCA would further decrease collinearity between predictor variables to the extent where it might miss out on correlations that are embedded in the predictors of the test set as well.

Fig 14 — PCA to predict median income and income per capita; very high RMSE as compared to later model

For our best subset, we chose “Hispanic, Black, Asian, Income, IncomePerCap, Poverty, Office, Construction, Production, Drive, WorkAtHome, Employed, PrivateWork, Unemployment, County” and worked with this subset for all future models. With the subset in hand, we proceeded to fit out different models —Linear Regression (Linear), Linear Regression with Gamma errors (Gamma), Linear Regression with Poisson errors (Poisson), Cross Validated Lasso (Lasso) and Cross Validated Ridge (Ridge).

Interestingly, we realized that the same model did not perform the best for both predictions.

RMSE Scores for Median Income

Based on the RMSE scores for Median Income, Lasso seems to perform the best with a score of 10,976.22 (5% of the highest median income). This is where splitting the data and using the RMSE was a useful measure. Before we split the data, the Gamma and Poisson models were our best models because the pseudo R-squared (we had to manually calculate a pseudo adjusted R-squared value) were the best performing and were around 0.9 for both. But after splitting the data into training and testing sets, we noticed that the Gamma and Poisson models performed the worst compared to the held out test set. Hence, we had a vivid case of our models overfitting to the training data.

Fig 16 — Coefficient plot (left) and predicted vs actual plot (right) for Lasso with best subset selection

Hence, Cross Validated Lasso is our best model for predicting Median Income.

RMSE Scores for Income Per Capita

Based on the RMSE for Income Per Capita, we find that the RMSE for the Linear Regression is the lowest with a very low value of 7572.28 (16% of highest income per capita value). In fact it even performs well in the fancy diagnostic plots.

Fig 18 — Diagnostic plots and predicted vs actual for Linear with best subset selection

Let’s take a closer look at the diagnostic plot. In the Residuals vs Fitted plot, we can see the data are simulated in a way that meets the regression assumptions very well. So, there is a linear relationship between the predictor and outcome, with a slight edge towards the end. This is normal as we have much fewer data points at the extreme right of the plot. Moreover, the residuals do not seem to have a pattern and are all scattered around randomly in the centre, strongly indicating that we are not missing out on any predictors. When we look at the QQ-plot, we can see the QQ-plot is almost a straight line. So, the residuals are normal distributed. For the Scale-Location plot, we can see it is almost a horizontal line with equally (randomly) spread points. So, we can say it is equal variance because residuals are spread equally along the ranges of predictors. Lastly, when considering the Residuals vs Leverage plot, we notice that there is no point outside the Cook’s distance. We can conclude that there is no influential leverage point in this case. Hence, this model performs very well on the diagnostic plots too!

Hence, a Log Transformed Linear Regression is our best model for predicting Income Per Capita.