Correlate… ALL THE THINGS: Part nulla
This is the background/definitions post for a 4 part series examining interesting relationships that can be found when you correlate every characteristic/attribute of fine grain (Census Tract) geographies in the mySidewalk database for a city/metro and then spot the interesting ones.
Outline
- Part I addresses my hometown — Omaha (the Omaha–Council Bluffs metropolitan area) because it’s relatively small (fast to calculate) and I feel comfortable pretending to be an expert on it
- Part II addresses NYC because it’s large, diverse, dense, and should be interesting to analyze
- Part III addresses San Francisco for all the same reasons as NYC but in different ways and on the opposite coast
- Part IV will use correlations for every tract in the United States as a baseline to observe interesting differences in the interactions between variables between the 3 regions examined previously
Background and definitions


Census Tract — A small, relatively permanent statistical subdivision of a county or equivalent entity that is updated by local participants prior to each decennial census as part of the Census Bureau’s Participant Statistical Areas Program.
Correlation Coefficient — A quantitative measure of some type of correlation and dependence, meaning statistical relationships between two or more random variables or observed data values. For this series, we will be using Pearson’s r as our correlation coefficient and it’s important to note that it ranges from -1 to 1 with 1 indicating a total positive correlation (the two variables increase in identical proportion to one another), 0 indicating no correlation (the two variables have no correlation whatsoever), -1 indicating a total negative correlation (the two variables have a perfectly inverse relationship to one another), and values in between indicating a relative strength and direction of correlation.
Characteristics — A fancy way to say “data observed for a particular geography”. Our application contains ~350 different characteristics (that number will likely date this article as the dataset is literally growing exponentially, that is we are increasing the rate at which data is added geometrically, which would be incredibly exciting if I wasn’t one of the poor schmucks who has to make sure your pretty charts about age group distribution still load in less than 150 milliseconds, no matter what) 213 of which were well populated and appropriate for correlation in this Omaha-Council Bluffs metro area of study (yielding 22,578 correlations) across 282 census tracts.


Normalization — “The creation of shifted and scaled versions of statistics, where the intention is that these normalized values allow the comparison of corresponding normalized values for different datasets in a way that eliminates the effects of certain gross influences” (via Wikipedia). The gross influences here are most typically large universes/populations from which the measurements were drawn. In order to facilitate better correlation analysis, I’ve normalized as many characteristics as possible, an improvement over the methodology in part I.