racist data destruction?

12 min readJun 13, 2019

a Boston housing dataset controversy

and an experiment in data forensics

Early in my data science training, my cohort encountered an industry-standard learning dataset of median prices of Boston houses in the mid-1970s, based on various social and ecological data about the area.

Our source of this dataset was Kaggle, although it lives in several standard training sets, including scikit-learn and Tensorflow.

Out of all the columns in this set, one very glaringly stuck out at me, for what I hope, dear Reader, are obvious reasons.

I’ve looked, and I can’t find anyone that will talk about that column in a technical fashion. There is plenty of commentary about the column, but not about how the already-questionably-included data has effectively been destroyed, and so in its current form is unusable and an unfortunate artifact of past assumptions about society.

So I did what any reasonable researcher would do. I went back to the source.

Hedonic pricing of Boston housing, 1970s-style

Hedonic pricing is a technique that involves modeling the prices of goods and services by features both internal and external to the good or service. It postulates that this kind of modeling is possible, and that marginal changes in an identified feature will have a corresponding effect in the price of the underlying.

The Boston housing data set was ostensibly compiled by (the grad students and/or assistants of) David Harrison Jr. (Harvard) and Daniel L. Rubinfeld (National Bureau of Economic Research) for analysis in the paper “Hedonic housing prices and the demand for clean air” (Journal of Environmental Economics and Management 5, 81–102 (1978), referred to hereafter as HHP). This report discusses features of air quality that may have affected the median prices in the 1970s housing market of the Boston Standard Metropolitan Statistical Area (SMSA).

A hedonic housing price affected by air quality, for example, can be affected by the feature of air pollution; more pollution would intuitively reduce the price of a house, all other parameters remaining fixed. It appears reasonable, under this assumption, that one could analyze multiple features of air pollution to see how prices change as each component of pollution changes.

My questions come in when other variables are presented as “standard” to a model like this, outside of the scope of the intended analysis.

Here are the features presented by Harrison and Rubinfeld, allegedly from “a common specification” (p. 98) of hedonic pricing of housing, with the type of variable, description, source of the data, and the transformation applied to the data in the pricing model (summary of Table IV, pp. 96–97):

Dependent:
MV: median value of owner-occupied homes. (1970 US Census)

Already we should have questions. This median value of owner-occupied homes: are we to accept that each entry in the dataset represents a tract of land? Yes. (We’ll address the source later.)

Structural:
RM: average number of rooms in owner units. (1970 US Census)
AGE: proportion of owner units built prior to 1940. (1970 US Census)

These seem reasonable.

Neighborhood:
B: Black proportion of population. (1970 US Census)

Waaaaaaaait a minute.

For pricing.

… OK.

I’ll swallow that for a minute and get through the rest of them before I start in.

LSTAT: Proportion of population that is lower status = 1/2 * (proportion of adults without some high school education and proportion of male workers classified as laborers). (1970 US Census)
CRIM: Crime rate by town. (FBI (1970))

Oh. Well. We’ll talk later. I already have a topic for the rest of this post.

ZN: Proportion of a town’s residential land zoned for lots greater than 25,000 square feet. (Metropolitan Area Planning Commission (1972))
INDUS: Proportion nonretail business acres per town.
(Vogt, Ivers, and Associates)
TAX: Full value property tax rate ($/$10,000).
(Massachusetts Tax-payers Foundation (1970))
PTRATIO: Pupil-teacher ratio by town school district.
(Mass. Dept. of Education (1971–1972))
CHAS: Charles River dummy: = 1 if tract bounds the Charles River;
= 0 otherwise. (1970 US Census Tract maps)

Well, that’s a lot of neighborhood features I have Feelings about.

Accessibility:
DIS: Weighted distances to five employment centers in the Boston region. (Schnare)
RAD: Index of accessibility to radial highways. (MIT Boston Project)

These feel reasonable.

Air pollution:
NOX: Nitrogen oxide concentrations in pphm. (TASSIM)
PART: Particulate concentrations in mg/hcm³. (TASSIM)

Finally, at the bottom, we get the two variables whose marginal effect the paper is allegedly about.

But let’s get back to that “B” column.

“B”

Just in case you’ve gotten this far without somehow paying attention, the column in question is called “B”:

B: Black proportion of population. (1970 US Census)

This is already offensive to 2019 eyes (and hopefully 1975 eyes). But let’s try to give the authors the benefit of some doubt that I’m missing something historical, or otherwise relevant, that would warrant the inclusion of this field in the “common specification” of hedonic house pricing.

Perhaps this is merely a column of data, extracted from the 1970 US Census, that has positive historical value. Maybe I’m being overly sensitive.

I will let the paper speak for itself.

Oh. No, forget it. This actually is a parameter in the model to modulate house pricing for systemic racism.

When I say “systemic racism” here, I mean this mathematically. This is a term, in a statistical model to predict housing prices, that accounts for racism as a factor in pricing. If this is used to predict, or even influence future models by its very existence, then systemic racism will continue to be a pricing factor.

We must unpack racist mathematics in order to dismantle its effect.

What does the commentary on B say? We split the sentences apart.

“At low to moderate levels of B, an increase in B should have a negative influence on housing value if Blacks are regarded as undesirable neighbors by Whites.”
“However, market discrimination means that housing values are higher at very high levels of B.”
“One expects, therefore, a parabolic relationship between proportion Black in a neighborhood and housing values.”

I must reach further back.

“In spite of the lack of any systematic evidence supporting the self-segregation hypothesis, it is difficult to dismiss. The problem lies in the fact that it is virtually impossible to determine conclusively the role of self-segregation as long as traces of white community antagonism toward black efforts to leave the ghetto remain.”
— Kain and Quigley, Housing Markets and Racial Discrimination: A Microeconomic Analysis (National Bureau of Economic Research, 1975)

According to Kain and Quigley, there is no systematic evidence to the notion that a “self-segregation” hypothesis drives a decline in price due to a rising demographic shift. There is no evidence that there exists a critical proportion (all other variables somehow remaining fixed, riiiight), after which society relabels a neighborhood “ghetto”, and a “ghetto premium” kicks into housing prices, flipping the price decline into an increase as the proportion of Black residents in the neighborhood continues to increase.

Nor is there evidence that this non-evidenced relationship is parabolic.

Yet here we are, reading this description in this paper that generated this dataset that can be found in scikit-learn and tensorflow.

But wait.

The entire reason I started investigating this issue was that the data in this column is not even clean data.

The technical problem is that the column “black” contains the parabolic transformation given in the paper. Let’s go to the description in the Kaggle dataset.

Only non-invertibly transformed (read: partially destroyed) data exists in the publicly-available dataset.

Invertible vs non-invertible functions

At this point, I must spend a small amount of time d̶o̶i̶n̶g̶ ̶t̶h̶e̶ ̶i̶m̶p̶o̶s̶s̶i̶b̶l̶e̶ holding my tongue and Speaking Only Mathematics. The given transformation of this data is quadratic: we have a function that, when graphed, shows a parabola in the percentage domain [0,1].

The problem with this kind of transformation is, it’s non-invertible.

A real-valued function is called invertible on its domain and range if it is injective; that is, if each output value (in the range) comes from exactly one input value. In the functional notation, this is

f is invertible if and only if inputs map uniquely to outputs (1–1 onto range)

For a much more intuitive invertible function, think about the process “put on socks, then put on shoes”. The inverse is “take off shoes, then take off socks”. You’ve “undone” the actions you took, and regained the original input (bare feet) from the original output (be-socked, be-shoed feet).

You can even notice how an inversion is done, in general: reverse the order of the steps, and for each step, reverse the action.

(Hey, if it works, don’t knock the pedagogy.)

All non-trivial linear functions are invertible. Some common linear functions that are used to transform data include:

shifting: Adding a constant value shifts a column. This transformation can be easily inverted by subtracting the same constant back off.

scaling: Multiplying by a constant scales a column. This transformation can be easily inverted by dividing the same constant back out.

normalization: Scaling with the sum of all values in a column, to get percentages; only makes sense for values that are all the same sign. The transformation is in [0,1].

min-max scaling: Shifting by the minimum of all values in the column, then dividing by the value range (max — min, different from the range of a function… hey, I didn’t write the vocab). The transformation is in [0,1].

standardization: Shifting by the mean, then scaling by the standard deviation.

Some non-linear invertible functions often used in data transformation are:

Exponential and logarithmic functions are each others’ inverses.

odd power (y = x, y = x³, y = x⁵, …)
even power (x > 0) (y = x², y = x⁴, …)
trigonometric (single period), such as the sine function:

Some non-invertible transformations include:

even power (positive and negative x)
trigonometric (multiple period)

The transformation we are given on the column “black” is the function (with an extra scaling of 1000 in the currently-available sets),

With domain 0 ≤ x ≤ 1, the parabola is noninvertible for 0.26 ≤ x ≤ 1, as each 0 ≤ y ≤ 0.1369 has two x values.

Harrison and Rubinfeld appear to have decided on a threshold of 63% at which to switch the regime of price decline to price increase (i.e. a so-called “ghetto threshold”). This column does not have the original data, which means that for any value of 0 ≤ y ≤ 0.1369 (136.9 for the scaled-up-by-1000 version of B in the given data sets), there are two possible values of x, the actual proportions given in the 1970 US Census, that is being referred to: one below the “ghetto threshold” of 63%, and one above.

And we cannot know which from this dataset, meaning we do not have the full story.

To show that this is an actual problem, and that points in this dataset do in fact fall into this situation, out of the 506 rows in the Boston housing set, there are 36 rows with a value less than 136.9 in the column “B”.

Row 103 contains the first point to question in this manner:

row 103 of the Boston housing data set

with transformed B=70.8. Let’s first scale that down by 1000 to B=0.0708. This corresponds, via the set-valued inverse (since an input y may return multiple output x),

to an original B of x = 0.364. Or, it could be x = 0.896.

Do you drop these 36 rows out of 506, and ignore the problem? It’s only 7.115% of the data, folks. I know you’ve dropped more for less.

Do you think someone is trying to hide this data, for better or for worse?

You tell me, data scientists. I’m delving further.

There’s a rule to be accepted here about the publishing of data: Publish only original data. Any transformation should only be applied inside a model, and noted. If you publish transformed data from a non-invertible transformation, and publish only the transformed data, the original data is unrecoverable.

Aside from the culturally problematic aspects of our column, this is the technical problem with “black”. We cannot recover the original data.

Or can we?

Back to the source

We actually have the original source of the data: the 1970 U.S. Census Tracts.

In particular, Boston’s data. Harrison and Rubinfeld pulled tract-level data from the Boston SMSA.

The original B data we seek sits in the first few rows of Table P-1, General Characteristics of the Population, on pp. 14–53 of the PDF (P-1 — P-40 of the original document).

There are more than 506 tracts listed. This is not a concern; what is a concern is that the Boston dataset does not list the original tract numbers. Thus, we have to hunt for which tracts in the Boston metro area were listed.

Our plan is relatively straightforward:

Note every entry of the row “Percent Negro” of these 40 pages with a value higher than x = 0.26 (on the non-invertible portion of the parabola).
Compute the two possible inverse x for each y ≤ 0.1369 in the Boston housing set.
If one of the two values of x is found for this y, then we know which x corresponds to this y. Note it.

There are 41 entries (if I noted correctly) in Table P-1 with B ≥ 0.26.

If we can match the 36 different y ≤ 0.1369 in the Boston housing data set to 36 of these 41 original values, then we can faithfully reconstruct the original HHP data.

[… does analysis, bangs head against things …]

Preliminary analysis finds 20 of the 36 points in the original data.
[Actual analysis to come once a writeup is complete.]

[Boston row #, lower %, higher %, which was found if found]. 7 ‘low’, 13 ‘high’ found, out of 36 possible.

Why only 20? This is not a probabilistic problem. The data exists, we are examining it, this should be a quick matchup, we should be done here. This is precalculus. This is algebra. Statistics should not come into play here.

(Those pesky s-words.)

Do you think it’s possible that, in doing quadratic calculations for this paper, Harrison and Rubinfeld’s team made mistakes on half of the suspected points? Do you think it’s possible that, since they were clearly not concerned with the original column B and its effect on their clean air calculations, that they didn’t bother to check as far as all 40 pages down the Census?

Do you think this data might be even more suspect than we thought?

Do you think?

But really, why did you do this?

Why not just drop “B” altogether? It sucks and the Census sucks.

There are arguments to be made that the US Census, and the ability to quickly collect, store, and electronically manipulate data on minority populations is perpetually detrimental to those populations.

However, data is data. The potential to point out systemic imbalance between populations (a little ANOVA’ll do ya), and offer an avenue to investigate potential discriminatory practices which lead to and sustain such imbalance, is also present in this data. I’ll close by posing a couple deeper questions.

I didn’t query the lower B numbers against the Census for accuracy. That should happen.
Is it possible to line up 1970 tract numbers to current locations? There is a correspondence document between 1960 and 1970 Census tract labels, but have not delved further into this. I’d like to see how the housing prices have evolved over the decades, and compare these to population demographics, to see shifts alongside prices over time.
By doing so, can we systematically measure how much value the Black community of Boston has lost, financially, due to this imbalance, relative to the average value of a home in all of the Boston SMSA? (I am, of course, assuming a loss; the data, and anyone else with the requisite research, can correct me if I’m somehow wrong about this.)
Couple this research with the well-documented practice of redlining to influence the continued concentration of disenfranchised communities in less-affluent locations. The mid 1970s, when the referenced materials in this post were written, was the heyday of Michael Dukakis’ attempts to weed out redlining from Massachusetts.

In addition to the social concerns here, there is the abstract issue mentioned earlier that should always be on the mind of a data scientist:

Are you sure your data makes sense?