Using BigQuery To Analyze Healthcare Data
The cloud makes collaboration easier
Today we wanted to discuss using cloud tools that are available to everyone to analyze a medical data set.
In particular will be using the Kaggle data set for medicare providers. This has information on diagnosis-related groups’ average costs, hospital locations, and interesting facts about providers and their service quality.
These data sets provide you the ability to answer questions about which states have the highest quality of service (based on the metrics they provide), the highest average costs, and the demographics of the patient bases.
Since healthcare costs tend to be one of the largest focuses, we are going to look at which states have the largest percentage of costs that are above average. We will be using SaturnCloud.io as a platform for our Jupyter Notebook and BigQuery in order to query the data.
Tools We Will Use
SaturnCloud.io allows you to easily spin up a VM that already has Jupyter Notebook installed. But it is really much more than just that.
Using SaturnCloud.io makes it easy to share your work with other teammates without having to deal with the hassle of making sure they have all the correct libraries installed. Also, for those of you who have had the issue of running Jupyter Notebooks locally and run out of memory, it also allows you to spin up VMs with the memory and ram required and only pay for what you use.
There are also a few other bells and whistles that really mitigate some of the complexities of work typically done from a DevOps perspective. That’s what makes tools like this great. In many ways, until you work in a corporate environment where even sharing a basic Excel document can become cumbersome, it is almost hard to explain to younger data scientists why this is so great.
Due to where the data is being stored, we will also need to interact with BigQuery. This is a Google Cloud service that allows you to query data. In this case, the Medicare data is open to everyone but is stored on Google Cloud.
As with most cloud services, this makes it convenient to manage as long as Google keeps this data public. It is really easy to pull the data and run queries without having to download a CSV and load it into your own database.
Before we get started, you can also check out the Jupyter Notebook here.
Connecting to BigQuery
For step one, we will be retrieving the
cms_medicare public data set. In order to do so, you will need to set up Google API credentials (see instructions here). You will notice that the credentials don’t so much require an API key as much as the .JSON file that has the key, as seen below.
Once you have set up your credentials, you can then create a reference to a client. This will be your access point.
To get an understanding about what you are working with, you can actually print the list of tables in the
cms_medicare data set. Using the code below you can iterate through the tables so you know what you are working with.
Now once you have pulled the data set, you actually have access to several tables. You can run the print statement above to get a list of all the tables.
For this analysis, we will be using the
inpatient_charges_2014 data set. This has information on the provider and the diagnosis-related group (DRG) code, which isn’t 100% on the diagnosis as is an ICD code, but it is what the data set contains. We say that because we will be comparing DRG codes across states. This isn’t really a fair comparison because there could be all sorts of specifics lost in the roll up that could lead to higher or lower costs. We are doing this for the sake of showing how you could do future work.
Start Your Analysis by Asking a Question
To start this analysis, let’s just ask a basic question. Like stated earlier, we want to know which states consistently charge more per DRG definition than other states.
What states consistently have higher-than-average prices?
In order to do so, we will need to calculate the average price at the granularity of state. This can be done both using pandas as well as SQL. (Personally, I believe SQL is usually better built for this kind of work, so that’s what we’ll use.) We can write a query like the one below.
Using this query as a subquery, we can join that both to the
hospital_general_info and then back to the original table. This will provide the original cost as well as the state information. This will allow us to compare the cost of
drg_definitions as well as group by state.
Then we can use a case statement to calculate what percentages of DRG definitions are over the average costs. This can be seen in the query below in the
Using this query, you can now figure out which states have the highest percent of DRG definitions that charge more than the average. Below you can see us pulling the top few, and you will notice that the highest states are Alaska, California, and Vermont.
Data Visualization — Mapping Healthcare Costs
This is a great example of somewhere where you can use a simple data visual to distill the information into a concise graphic. Looking at numbers and state codes is hard. It becomes hard to really get what is going on. Instead, let’s use the Ploty library to map it out.
With this map, you can better see which states are the biggest culprits of high costs. Instead of being stuck staring at numbers, you now have a clear picture of which states charge more than the average.
Again, this isn’t at the granularity of ICD codes, so a lot of information is lost. But this is a good generalization.
From here, we would look into the various high-cost states and see if there were specific DRG codes that were consistently over at the provider level. But for now, we will wrap up this post before it gets too long.
If you would like, you can check out the Jupyter Notebook here. You might even want to pose you own questions about the costs of different medical procedure categories, the level of service at different providers, and other conclusions you can draw using this data set.