Introduction to U.S. GDP Data
As a data scientist student, we’re encouraged to look at the world around us and find a piece of data that we find interesting. Since my first economics class in high school, I’ve found the geopolitical trade forces that govern the world inherently fascinating. The world has morphed into an impossibly complex system of trade agreements, interest payments, and cooperation between people and nations. I want to take a high-level look at an omnipresent metric that’s used by the media, politicians, and others as a bellwether economic health:
The quarterly rate of change of Gross Domestic Product or GDP rate.
This blog (and likely subsequent ones) is going plot the GDP of the U.S. plus its 50 states and the District of Colombia since the end of Q1 2005 so we can explore the clean data set and make observations about some overall trends in the U.S. over the past 13 years.
FANTASTIC DATA AND WHERE TO FIND IT
All of the data used in this first blog will be sourced from the Bureau of Economic Analysis (BEA), which falls under the U.S. Department of Commerce. The BEA is responsible for compiling GDP on a quarterly basis and makes its findings and any subsequent revisions available for public use. While the GDP rate is the headline, the accompanying data set provides a trove of information that illustrates the different inputs into GDP and buckets them into categories (such as industry- or region-specific), so the public at large can analyze the data and make conclusions about the economy as a whole. Here’s a link to their primer if you’d like to take a look. My initial move will be to clean the data sourced from the BEA’s website and explore the data by showing the change in the GDP.
THE INITIAL REVIEW
I was able to download the “Quarterly Gross Domestic Product (GDP) by State” csv file from the Regional Economic Accounts section of their website, which included all of the available components as well as the broader statistics. I converted the file into python and turned it in to a list of dictionaries.
This initial data set contained 5,763 entries. Knowing that I needed to pare down the data and after some light exploring, I found that the data needed to be cleaned in the following ways:
- The quarterly GDP values needed to be coerced into usable number (floats) so I could later calculate a quarter’s GDP rate.
- The non-state data needed to be filtered out, including regional information and erroneous designations.
- Entries that did not have “All industries total” as their Description were removed.
- Only entries that include the Component “Real GDP by state (millions of chained 2009 dollars)” should be in scope. This field in particular will give me the raw amount of GDP for a jurisdiction in a given quarter.
After cleaning, I was left with exactly 52 entries (50 states + DC + U.S. overall). My last step before exploring the data was to reformat the GDP data within each jurisdiction, so I could more easily manipulate the data set in the future. This included adding a “Quarterly GDP” key whose value is a list of dictionaries, each containing the quarter end and raw GDP.
Breaking out the data set in this way allowed me to more easily reference each figure and, using Python, calculate the GDP rate on a quarter-by-quarter basis.
Now that I have useable database, we can begin to explore our data set. From a picture on how the U.S.’s GDP rate has performed overall:
Here’s what all 52 lines look like on one graph…Its kind of beautiful in a “I threw paint again the wall and here’s what happened kind of way.”
Jokes aside, even looking at this data for a moment, I start to notice some interesting outliers. For instance see the purple line that stick out post financial crisis around Q4 2011 to Q1 2012? That’s…North Dakota?
My first question is what would cause a state with the GDP equivalent of Croatia* spike to have the three best growth periods post-financial crisis in the U.S. so far?
*See PS for a map of the U.S. and their similar country equivalents
WHY DO THIS?
As I said earlier on, I find this information fascinating. It’s widely used, complex, relevant data that businesses and academics use to better understand the world. Entire academic papers could be (and I suspect have been) written about flaws or issue using the GDP rate as a metric of economic health. Not to mention that most media outlets have covered these topics ad nauseam. But for now…it’s what I have… and I’m starting my journey toward trying to understand the world a little better through data science and maybe someone else learns something along the way too.
Plus, as we get deeper into linear algebra and machine learning, new kinds of questions and ways to look at data will become available, which has the potential to open the door to new ideas, questions and opportunities for learning.
I have a few ideas for the next project with this data set. Now that it’s in a useable form, I’m thinking I’ll look at what has happened to different states and industries since the financial crisis and as; questions like: Which have recovered or even grown? Which have struggled? Where was there any indication or trend before the crisis of that struggle?
Then I could see future plans looking at different data sources to take a deeper dive at the trends found through this analysis, such as what caused North Dakota’s growth spike at the end of 2011 and into 2012?
P.S. Here’s a chart of U.S. states and countries with similar GDP.
P.P.S. G+J ^-^ 11/07/2015