Cleaning Arlington Income Data with Pandas

Raven Welch
6 min readJun 12, 2020

As someone in life who has grown up with a fair amount of privilege, it can be difficult to really understand what effects poverty can have on just about everything. In part because of that, I wanted to take a look at the statistics in terms of the distribution in median income throughout Arlington County, as well as the graduation and dropout rates for our high schools as well as what percentage of students move on to college. However, to do that, I have to get the data first.

My first stop: https://data.census.gov/. From here, I downloaded CSVs containing 9 years of income data organized per tract in Arlington (2010–2018)

With the data now in hand, it was time for the tricky part: cleaning up and reorganizing everything so that it was usable. The only problem was, I wasn’t quite sure exactly how I wanted the data, so it was time to play around with it. I started with just picking 2018 data, so I could get a good feel for things before including all 9 years.

import pandas as pdCensusData_df = pd.read_csv('data/CensusData_2010_2018/ACSST5Y2018.S2001_data_with_overlays_2020-06-05T134933.csv', header = 1)
The joys of government data

Well, it’s a start! I had also downloaded the GeoJSON file containing polygons that represented tract boundaries for mapmaking, so I knew that I would need the Tract IDs (In the “id” column) to just be the last six characters. Plus, having 242 columns felt like it was way too unreadable, so I transposed the dataframe to help clean things up a bit.

CensusData_df = pd.read_csv('data/CensusData_2010_2018/ACSST5Y2018.S2001_data_with_overlays_2020-06-05T134933.csv', header = 1)CensusData_df["id"] = CensusData_df["id"].str.strip().str[-6:]CensusData_df = CensusData_df.transpose()CensusData_df.columns = CensusData_df.iloc[0]CensusData_df.drop(axis = 1, index = CensusData_df.index[:2], inplace = True)

Fantastic, this is a lot easier to read, though there’s still far too many rows for my liking. Next, I needed to figure out what rows to drop… and it turns out that at least half of them were just the margin of error. A lot were also divided by male and female, and while that would be interesting data to check out, it wasn’t quite my goal for today. Percentage as well as data on only people with college education, weren’t quite what I was looking for. What I wanted was the median income for each tract.

A small glance at the extremely large number of rows

To help clear out all of these unwanted rows, I created a for loop that would look at each row, and then add it to the “Index_To_Remove” list if it either contained a word in “Index_Keywords,” or did not contain the word Median.

Index_To_Remove = []Index_Keywords = ["Margin", "EDUCATIONAL", "Percent", "Female!!", "Male!!"for index_item in CensusData_df.index:
for item in Index_Keywords:
if item in index_item or "Median" not in index_item:
Index_To_Remove.append(index_item)

Once that was done, I created a new dataframe equal to the original Census Data, dropping all of the rows in “Index_To_Remove.” Two rows were left, one containing average yearly salary for everyone, and one containing average yearly salary only for people with full time jobs, so I dropped the one containing year round salary for everyone assuming that there may be a lot of teenagers or college students who were only part time employees. Additionally, I renamed ID to TractID and the last column left to “Median Income (USD)” as well.

MedianIncome_cd_df = CensusData_df.drop(Index_To_Remove).tail(1)Median_Census_Income_2018 = MedianIncome_cd_df.transpose().reset_index().rename(columns = {"Estimate!!Total!!Population 16 years and over with earnings!!FULL-TIME, YEAR-ROUND WORKERS WITH EARNINGS!!Median earnings (dollars) for full-time, year-round workers with earnings": "Median Income (USD)","id" : "TractID"})
Delightfully reduced data!

Now that I was satisfied, I moved on to begin applying this to ALL 9 files. After importing Pandas and NumPy, I was ready to go. However, the new question was: How do I import all nine files without copying and pasting the file paths nine times? The answer: using a tuple, list comprehension, and a for loop. Using those, I added all nine sets of data to a dictionary, with the keys being what year the data was from.

Last_Digit_Of_Year = (0,1,2,3,4,5,6,7,8)files = ["ACSST5Y201" +str(i)+ ".S2001_data_with_overlays_2020-06-05T134933.csv" for i in Last_Digit_Of_Year]RawCensusData_Dict = {}for file in files:
year = file[7:11]
filepath = "data/CensusData_2010_2018/" + file
RawCensusData_Dict[year] = pd.read_csv(filepath)

Now we’re back to square one. Since I’m now dealing with cleaning 9 different dataframes, I’ll be using a lot of for loops to help clean each one starting with this one, which’ll be repeating some of the previous steps I had originally tested out on just the 2018 data.

for year in RawCensusData_Dict:
internal_df = pd.DataFrame(RawCensusData_Dict[year].values)
internal_df.columns = internal_df.iloc[0]
internal_df.drop(internal_df.index[0], inplace = True)
internal_df["id"] = internal_df["id"].str.strip().str[-6:]
internal_df.rename(columns = {"id":"TractID"}, inplace=True)
internal_df = internal_df.transpose()
internal_df.columns = internal_df.iloc[0]
internal_df.drop(axis = 1, index = internal_df.index[:2], inplace=True)
RawCensusData_Dict[year] = internal_df

After the initial cleaning and reorganizing, it was time to take that code from earlier and refit it. However, during this process I discovered that a lot of rows simply had (X) for the values, so I replaced those with NaN and dropped any row where all of the columns equaled NaN. Additionally, I added a prefix to each column name so that the year was easy to see when I began joining the dataframes together.

CleanerCensusData_Dict = {}
Index_Keywords = ["Margin", "EDUCATIONAL", "Percent", "Female!!", "Male!!"]
for year in RawCensusData_Dict:
Index_To_Remove = []
for index_item in RawCensusData_Dict[year].index:
for item in Index_Keywords:
if item in index_item or "Median" not in index_item:
Index_To_Remove.append(index_item)
#MI_CD_DF = Median Census Data, Data Frame
MI_CD_DF = RawCensusData_Dict[year].drop(Index_To_Remove)
MI_CD_DF.replace("(X)", np.nan, inplace=True)
MI_CD_DF.dropna(axis = 0, how = 'all', inplace=True)
CleanerCensusData_Dict[year] = MI_CD_DF.transpose().add_prefix(year + "_")

Uh oh… it turned out that only 2017 and 2018 data contained the median income only for those who had full time jobs. With that knowledge, I realized I would have to use the median income for everyone regardless of how much they actually worked.

2012 vs 2017 census data

With that in mind, as I joined all of the dataframes I dropped the full-time only columns.

Index_TractID = list(CleanerCensusData_Dict["2014"].index)
CombinedCensusData = pd.DataFrame(index = Index_TractID)
for year in CleanerCensusData_Dict:
CombinedCensusData = CombinedCensusData.join(CleanerCensusData_Dict[year])
CombinedCensusData.drop(columns = [CombinedCensusData.columns[10],CombinedCensusData.columns[8]], inplace = True)
All of the years, together at last

I’m almost there! All there is to do is make sure the index has an easy-to-understand name, and give all of the columns much shorter names.

Number_of_Columns = list(range(0,len(CombinedCensusData.columns)))
Col_Name = None
New_Col_Name = None
for i in Number_of_Columns:
Col_Name = CombinedCensusData.columns[i]
New_Col_Name = "Median Income (" + CombinedCensusData.columns[i].split("_")[0] + ")"
CombinedCensusData = CombinedCensusData.rename(columns={Col_Name: New_Col_Name})
CombinedCensusData.index = CombinedCensusData.index.rename("TractID")
Finally… some clean data

Fantastic! Now this data is ready to use, though it won’t be very readable until create a map with all of this data. However, I’ll still need to collect and organize the information for high schools to compare graduation rates, and after a bit of thinking, I definitely would love to add on statistics on race to each tract as well. The census data also included how many people were in different categories of income, so I’m also interested in looking into that data too at a later date, but perhaps not with this project in particular.

Unfortunately, with this dataset only the 2010 numbers are actual accurate numbers. Everything else is an estimated number. Because of that, I don’t actually know how accurate all of these numbers are, and likely won’t until the 2020 census data is released. While that gives me a reason to be excited for the 2020 census, it does make me wonder how accurate is this data? How trustworthy is it?

--

--