Correlate… ALL THE THINGS: Part nulla

This is the background/definitions post for a 4 part series examining interesting relationships that can be found when you correlate every characteristic/attribute of fine grain (Census Tract) geographies in the mySidewalk database for a city/metro and then spot the interesting ones.

Outline

Background and definitions

“Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’.” — Image and hilarity credit to xkcd.com

Census Tract — A small, relatively permanent statistical subdivision of a county or equivalent entity that is updated by local participants prior to each decennial census as part of the Census Bureau’s Participant Statistical Areas Program.

Correlation Coefficient — A quantitative measure of some type of correlation and dependence, meaning statistical relationships between two or more random variables or observed data values. For this series, we will be using Pearson’s r as our correlation coefficient and it’s important to note that it ranges from -1 to 1 with 1 indicating a total positive correlation (the two variables increase in identical proportion to one another), 0 indicating no correlation (the two variables have no correlation whatsoever), -1 indicating a total negative correlation (the two variables have a perfectly inverse relationship to one another), and values in between indicating a relative strength and direction of correlation.

Characteristics — A fancy way to say “data observed for a particular geography”. Our application contains ~350 different characteristics (that number will likely date this article as the dataset is literally growing exponentially, that is we are increasing the rate at which data is added geometrically, which would be incredibly exciting if I wasn’t one of the poor schmucks who has to make sure your pretty charts about age group distribution still load in less than 150 milliseconds, no matter what) 213 of which were well populated and appropriate for correlation in this Omaha-Council Bluffs metro area of study (yielding 22,578 correlations) across 282 census tracts.

via xkcd

Normalization — “The creation of shifted and scaled versions of statistics, where the intention is that these normalized values allow the comparison of corresponding normalized values for different datasets in a way that eliminates the effects of certain gross influences” (via Wikipedia). The gross influences here are most typically large universes/populations from which the measurements were drawn. In order to facilitate better correlation analysis, I’ve normalized as many characteristics as possible, an improvement over the methodology in part I.