Let Data Tell the Story


Almost every time I tell friends and family I’m attending a data science bootcamp, they immediately ask, “What is data science?” According to Wikipedia, “Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to datamining.” If I were to provide them with Wikipedia’s definition, I’d likely receive many questions surrounding the answer itself, so instead of directing them to Wikipedia, I like to defer to Flatiron School’s definition on Learn.co.

Flatiron School describes data science as the intersection of programming skills, math and statistics, and business intelligence. Programming skills are necessary to collect data, business intelligence is essential to understand the data’s relevance, and math and statistics are important tools used to build predictive models. Through the use of these tools, data can drive business decisions.

Data can only drive business decisions if we allow it to tell a story. As a part of human nature, we often become stuck on ideas and simply search for data to hopefully support our preconceived notions. In college, I took a class called Healthcare in America, where instead of allowing data to tell stories, I was accustomed to telling stories based on theories pulled from a variety of papers and books. I distinctly remember learning about the extraordinary amount the United States spends on healthcare. When writing about it, I would state it as a fact based on prior readings and maybe search for a paper that noted minimal numbers to support this idea. Data served as a supporting factor, rather than as a focal point.

Once I started Flatiron School’s data science bootcamp and realized that even minimal analysis has powerful potential, I decided to revisit healthcare spending data and let it tell a story.

In the Organisation for Economic Cooperation and Development’s (OECD) iLibrary, I found healthcare spending data broken down by country. I was intrigued to see what insights the dataset would offer and was excited to find a downloadable CSV file.

Using the Pandas library, I downloaded the CSV file and created a list of dictionaries to begin sorting through the data.

The first two dictionaries within my list of dictionaries after reading the CSV file and using Pandas’ to_dict().

At this point, it was time to begin analysis. I needed to decide how to guide the data to tell a meaningful, accurate story. As with any sort of spending, it is interesting to look at those in the extremes — in this case, the countries that spend the most on healthcare. As an initial step, I broke the data into two groups to help show any changes over time. I filtered one group to show total expenditure in 2000 and I filtered the other to show total expenditure in 2017.

I was eager to determine those countries that spent the most on healthcare in 2000 and the extent to which their healthcare spending increased (inflation aside, I figured I’d unfortunately see large increases ) between 2000 and 2017. Using nlargest from Python’s heapq module, I derived a list of the ten highest healthcare spending values in 2000 and a list of dictionaries associated with the value list.

Next, I iterated through and filtered the two_thousand_seventeen_list() so the remaining dictionaries within the list were only those for which the ‘Location’ value was found in two_thousand_countries().

At this point, I wanted to see my data and chose to do so with a histogram. After defining my histogram buckets and corresponding values, I used plotly to graph it.

Healthcare spending in USD per capita in 2000 and 2017 for the countries with the highest spending in 2000

No surprise — I immediately notice the United States was the country with the highest healthcare spending in both 2000 and 2017, and evidently the country with the greatest increase in healthcare spending during that time period.

I was also curious what the same sort of histogram would look like, but looking at healthcare spending as a percentage per GDP, rather than the spending amount as USD per capita. Fortunately, I was easily able to reuse the above code. I simply changed the CSV file path from which the data pulled and used the exact same code to create the below histogram.

Healthcare spending as a percent of GDP for 2000 and 2017 for the countries with the highest spending in 2000 *No BRA data provided for 2017

Yet again, the United States is towering over other big spenders. Before looking at this data, I knew the United States has a highly inefficient healthcare system and spends more than its peers on healthcare. By sorting through and visualizing this data, the extent to which the United States spends more on healthcare than its peers is clear. The data told a story.

And with the amazing power of data, there is so much yet to be told. There are various ways to further analyze this data set, plus additional data sets can be introduced. From here, it would be interesting to integrate data surrounding healthcare outcomes and healthcare equity. Do the countries that spend the most on healthcare necessarily have better healthcare outcomes and more equitable healthcare access? If the data were grouped based on healthcare models, would there be any evident trends?


Quite frankly, I could continuously list additional steps that can be taken to further analyze this data, as well as related data. What’s important to recognize is that each of these steps serves as a guide to allow the data to tell a story. Though there are no groundbreaking conclusions from my look into healthcare spending, it tells a story and goes beyond simply stating “in 2017, the United States spent $2,000 per capita more than any other nation.” It serves as the introduction, as the blueprint, for analysis. From here, there are an innumerable number of additional ways in which data can be gathered and analyzed to continue the story.