Visualizing climate data for Ann Arbor, MI

Nathan W. Doctor — Sun, 21 Jun 2020 00:59:36 GMT

Using an NOAA dataset, we’ll write some python code which returns a line graph of the record high and record low temperatures for Ann Arbor, Michigan, for each day of the year over the period 2005–2014. Then we’ll overlay a scatter of the 2015 data for any points (highs and lows) for which the ten year (2005–2014) record high or record low was broken in 2015.

The data comes from a subset of The National Centers for Environmental Information (NCEI) Daily Global Historical Climatology Network (GHCN-Daily). The GHCN-Daily is comprised of daily climate records from thousands of land surface stations across the globe.

The stations the data comes from are shown on the map below.

To start, let’s take a look at the data.

The ID column represents the station ID where the temperature was collected. The Element column represents whether the temperature recorded was a maximum or minimum temperature. The Data_Value column represents the temperature in tenths of a degree Celsius.

We’re working with a pretty clean dataframe here, but we’ll need to make some minor adjustments to get everything into a more suitable format.

The point of the exercise is to compare data from 2005–2014 to 2015. Since there aren’t any leap days in 2015, those will have to be removed. We’ll also separate the Year-Month-Day format to two columns with the Month-Day in one and the Year in another. Lastly, I’m American, so I like Fahrenheit. Data in tenths of degrees Celsius will be converted to whole degrees of Fahrenheit.

Next, let’s split the dataframe into two dataframes, one for data from 2005–2014 and one for data from 2015.

Looks good to me. Let’s just check to make sure.

The shapes of the split dataframe matches up with the cleaned version. And just to be extra cautious, I’ll just look to make sure the 2005–2014 dataframe and the 2015 dataframe each have all the right years.

Now, I’ll create one final dataframe for all the maximum and minimum temperatures.

The Max_05_14 column will represent the maximum recorded temperature for each day (01–01, 01–02, etc.) between 2005–2014.
The Min_05_14 column will represent the minimum recorded temperature for each day between 2005–2014
The Max_15 column will represent maximum recorded temperature for each day for 2015
The Min_15 column will represent the minimum recorded temperature for each day for 2015
The Max_15_Higher_Prev_10_Years column will represent the maximum recorded temperature for each day for 2015 but only if it is higher than the maximum temperature from 2005–2014
The Min_15_Lower_Prev_10_Years column will represent the minimum recorded temperature for each day for 2015 but only if it is lower than the minimum temperature from 2005–2014

The dataframe has 365 rows, each one representing a day of the year. So, the entry of 60.08 in row 01–01 in column Max_05_14 means the highest temperature recorded on January 1 between 2005–2014, at any of the varying stations around Ann Arbor, was 60.08 °F.

The NaN entries signify that, in 2015, there was no recorded high or low that was higher or lower, respectively, than the corresponding entries for 2005–2014. The entry of 4.1 °F on January 5, 2015, is, indeed, lower than the lowest recorded temperature from 2005–2014 of 5.00 °F. Accordingly, that entry will be included.

Now, let’s get to the main point of this: plotting the data.

At last, I’m able to visualize what we’re working with, but this still needs some work. First, the legend is a bit unnecessarily large and comes too close to the record low line. It may even be blocking some dots from the record lows for 2015. Second, while I understand exactly what the [0, 50, 100, 150, 200, 250, 300, 350] values on the x-axis represent, and the audience should be able to get it as well, this is certainly not ideal. The goal of a chart like this is for the viewer to be able to make sense of it in as little time as possible. The more they’re looking at the data and the less they’re reading, the better. Third, it may look nice to have the area between the two lines filled in. Why not try that out? Finally, there’s a lot of unnecessary lines and all the black on the chart is a bit too harsh. Let’s dampen that up a bit.

And that’s it. Thanks for reading.

Hypothesis Testing: Are university towns more resilient than non-university towns to recession?

Nathan W. Doctor — Sat, 20 Jun 2020 05:21:09 GMT

Hypothesis Testing: Are university towns more resilient to recession than non-university towns?

For an early project, I sought to use Python to examine if university towns are more resilient to economic downturn than non-university towns. More specifically, I asked are the housing prices in university towns less effected by recession?

To start, a university town is a city which has a high percentage of university students compared to the total population of the city.

The hypothesis is that we can expect housing prices in such cities to be less effected by recession mostly because we should expect similar numbers of students, staff, and other workers connected to university life to live in such towns, regardless of the economic outlook.

To get a list of university towns, I simply used Wikipedia, which maintains a list of college towns in the United States. For a spreadsheet on housing prices across the United States, I used Zillow, which included data from 1996–2020 in City_Zhvi_AllHomes.csv. And lastly, I used the U.S. Department of Commerce, Bureau of Economic Analysis (BEA) to figure out when exactly the ‘Great Recession’ of 2007–2009 started and when the recession reached its bottom i.e. the quarter within the recession which had the lowest GDP. This was necessary because I sought to compare housing prices at the start of the recession to prices at the bottom.

To match the format of the list of university towns from Wikipedia to the list of all cities on Zillow, I would need to clean the text file derived from Wikipedia a bit.

Not the most elegant solution here, but at least it works..

Next, to get the start of the recession, let’s load data from the BEA and find the recession’s start. A recession is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth.

As you can see, we’ll need to clean this dataframe a bit..

Now, let’s convert the housing data from Zillow to quarters.

Next, we’ll create new data showing the decline or growth of housing prices
between the recession start and the recession bottom.

And finally, we’ll run a t-test comparing the university town values to the non-university towns values, return whether the alternative hypothesis (that the two groups are the same) is true or not, and the p-value of our confidence.

The function will return the tuple (different, p, better) where different=True if the t-test is True at a p<0.01 (we reject the null hypothesis), or different=False if otherwise (we cannot reject the null hypothesis). The variable p should be equal to the exact p value returned from scipy.stats.ttest_ind(). The value for better should be either “university town” or “non-university town” depending on which has a lower mean price ratio (which is equivilent to a reduced market loss)

As we can see, there is a difference between the mean housing prices of university towns and non-university towns. As the p-value is less than .01, we can reject the null-hypothesis (that there is no significant different between university towns and non-university towns). In other words, we can see that there is a difference between university towns and non-university towns and that university towns are, indeed, less effected by recession.

Hypothesis Testing: Are university towns more resilient than non-university towns to recession? was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by Nathan W. Doctor on Medium

Visualizing climate data for Ann Arbor, MI

Hypothesis Testing: Are university towns more resilient than non-university towns to recession?

Hypothesis Testing: Are university towns more resilient to recession than non-university towns?