Inspecting a CDC Chronic Disease Dataset

Recently, I’ve taken on a personal project to apply the Python and machine learning I’ve been studying. Since I’ve an interest in population health, I decided to start by focusing on understanding a 15 year population health specific dataset I found on Kaggle. This dataset was from the US Center for Disease Control and Prevention on chronic disease indicators. In this blog series, I want to demonstrate what is in the dataset with exploration. Later on, I’ll go into more of the data visualization. At this time, I’m not sure I see the opportunity for actual machine learning with only this dataset.

According the the overview on Kaggle, the limited contextual information provided in this dataset notes that the indicators are collected on the state level from 2001 to 2016, and there are 202 indicators. Using jupyter notebook and pd.read_csv() on the file, there are 403,984 rows with 34 columns, or attributes. By running .info() method, the second column in the output below shows that we’ve some missing data. In the last column below, there are different types of data where some are numerical such as integers and floating values and others are objects containing strings of characters.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 403984 entries, 0 to 403983
Data columns (total 34 columns):
YearStart 403984 non-null int64
YearEnd 403984 non-null int64
LocationAbbr 403984 non-null object
LocationDesc 403984 non-null object
DataSource 403984 non-null object
Topic 403984 non-null object
Question 403984 non-null object
Response 79323 non-null object
DataValueUnit 374119 non-null object
DataValueType 403984 non-null object
DataValue 297817 non-null object
DataValueAlt 273666 non-null float64
DataValueFootnoteSymbol 188019 non-null object
DatavalueFootnote 187853 non-null object
LowConfidenceLimit 246819 non-null float64
HighConfidenceLimit 246819 non-null float64
StratificationCategory1 403984 non-null object
Stratification1 403984 non-null object
StratificationCategory2 79323 non-null object
Stratification2 79323 non-null object
StratificationCategory3 79323 non-null object
Stratification3 79323 non-null object
GeoLocation 401416 non-null object
ResponseID 79323 non-null object
LocationID 403984 non-null int64
TopicID 403984 non-null object
QuestionID 403984 non-null object
DataValueTypeID 403984 non-null object
StratificationCategoryID1 403984 non-null object
StratificationID1 403984 non-null object
StratificationCategoryID2 79324 non-null object
StratificationID2 79324 non-null object
StratificationCategoryID3 79323 non-null object
StratificationID3 79323 non-null object

Let’s understand what each column is about. While some of the column names are relatively self-explanatory, I used set(dataframe[‘ColumnName’]) to better understand the unique categorical data. Here are some examples:

Topic: 400k+ rows of data are grouped into the following 17 categories. There is a corresponding column called TopicID that simply gives an abbreviated label.

{'Alcohol',
'Arthritis',
'Asthma',
'Cancer',
'Cardiovascular Disease',
'Chronic Kidney Disease',
'Chronic Obstructive Pulmonary Disease',
'Diabetes',
'Disability',
'Immunization',
'Mental Health',
'Nutrition, Physical Activity, and Weight Status',
'Older Adults',
'Oral Health',
'Overarching Conditions',
'Reproductive Health',
'Tobacco'}
Distribution of the data by topic. Diabetes, Chronic Obstructive Pulmonary Disease, and Cardiovascular Disease are the top 3 topics

Question: Within each topic, there are a number of questions. There is a corresponding column QuestionID that we’ll use. These are the 202 unique indicators that the dataset has values, and we’ll analyze this further.

{'Activity limitation due to arthritis among adults aged >= 18 years',
'Adults aged >= 18 years with arthritis who have taken a class to learn how to manage arthritis symptoms',
'Adults with diagnosed diabetes aged >= 18 years who have taken a diabetes self-management course',
'Alcohol use among youth',
'Alcohol use before pregnancy',
'All teeth lost among adults aged >= 65 years',
'Amount of alcohol excise tax by beverage type (beer)',
...
 'State child care regulation supports onsite breastfeeding',
'States that allow stronger local tobacco control and prevention laws',
'States with strong polices that require retail licenses to sell tobacco products',
'Taking medicine for high blood pressure control among adults aged >= 18 years with high blood pressure',
'Television viewing among high school students',
'Timeliness of routine health care checkup among women aged 18-44 years',
'Visits to dentist or dental clinic among adults aged >= 18 years',
'Visits to dentist or dental clinic among adults aged >= 18 years with diagnosed diabetes'}

DataSource: Given that we’ve so many indicators, I’m not surprised that there are 33 data sources. However, the following histogram shows that the majority of the data comes from two sources, BRFSS, which is CDC’s Behavioral Risk Factor Surveillance System, and NVSS, which is the National Vital Statistics System.

{'ACS 1-Year Estimates',
'ACS 5-Year Estimates',
'AEDS',
'ANRF',
'APIS',
'ASHW',
'BRFSS',
'Birth Certificate, NVSS',
'CDC Breastfeeding Report Card',
"CDC Children's Food Environment State Indicator Report",
'CDC School Health Profiles',
'CMS CCW',
'CMS Part A Claims Data',
'Current Population Survey',
'Death certificate',
'HRSA, UDS',
'InfoUSA; USDA',
'Legal Research',
'NSCH',
...
CDC’s BRFSS and NVSS make up almost the entirety of this dataset

DataValueUnit: Values in DataValue consist of the following units, including percentages, dollar-amounts, years, and cases per thousands.

{' ',
'$',
'%',
'Number',
'Number ',
'Years',
'cases per 1,000',
'cases per 1,000,000',
'cases per 10,000',
'cases per 100,000',
'gallons',
nan,
'pack sales per capita',
'per 100,000',
'per 100,000 residents'}

DataValueType: The following categories are insightful showing that there are age-adjusted numbers vs the raw numbers which help us with comparison when we want to look at data comparing across states.

{'Adjusted by age, sex, race and ethnicity',
'Age-adjusted Mean',
'Age-adjusted Prevalence',
'Age-adjusted Rate',
'Average Annual Age-adjusted Rate',
'Average Annual Crude Rate',
'Average Annual Number',
'Commercial host (dram shop) liability status for alcohol service',
'Crude Prevalence',
'Crude Rate',
'Local control of the regulation of alcohol outlet density',
'Mean',
'Median',
'Number',
'Per capita alcohol consumption',
'Percent',
'Prevalence',
'US Dollars',
'Yes/No'}

DataValue vs DataValueAlt: DataValue appears to be the column of data that will be the target in our future analysis. Using .head() method, this column consists of numerical values as string objects while DataValueAlt is numerical float64. Later on, I want to use pandas pivot_table method which requires only numerical data. As result, I will be using DataValueAlt to produce on the analysis down the line.

Stratification and Stratification Category related columns: There are 12 columns related to stratifications, which are subgroups within each indicator such as gender, race, age, and etc. In StratificationCategory1, there is gender, overall, and race. In Stratification1, the values consist of the types of race as an example. In the ID columns such as StratificationID1, we have corresponding labels for race.

Missing Data:

Using a matplotlib below and a seaborn to produce a heatmap, it’s easy to see where there is data and where is it missing and how much is missing. Yellow represents the missing data. The columns are each of the indicators, and the vertical axis is just the 400k rows of data.

import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(16, 6))
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Large portions of missing data is related to Stratification Categories 2 and 3 and the related Stratification 2 and 3. The horizontal axis consists of the column names.

In the heatmap, Response and the columns related to StratificationCategory 2/3 and Stratification 2/3 have less than 20% data. While StratificationCategory1 and Stratification1 appear to have data that is potentially useful, let’s confirm what data is in 2 and 3. For each stratification column, I follow a similar approach:

df['StratificationCategory2'].count()
df_strat2cat = []
for i in df['StratificationCategory2']:
if (pd.isnull(i) is False):
if i is not ' ':
df_strat2cat.append(i)

As an example, the count of the column returned 79k that had data. I wanted to see what’s in there so I set up for loop to go through each element in the specific stratification 2 or 3 column and append values that are not null or with blank spaces to a new array called df_strat2cat. This resulted in an array with no values surprisingly. After repeating this with the other stratification columns, I dropped this set of columns.

df_new = df.drop(['Response','ResponseID','StratificationCategory2','StratificationCategory3','Stratification2','Stratification3','StratificationCategoryID2','StratificationCategoryID3','StratificationID2','StratificationID3' ],axis = 1)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 403984 entries, 0 to 403983
Data columns (total 24 columns):
YearStart 403984 non-null int64
YearEnd 403984 non-null int64
LocationAbbr 403984 non-null object
LocationDesc 403984 non-null object
DataSource 403984 non-null object
Topic 403984 non-null object
Question 403984 non-null object
DataValueUnit 374119 non-null object
DataValueType 403984 non-null object
DataValue 297817 non-null object
DataValueAlt 273666 non-null float64
DataValueFootnoteSymbol 188019 non-null object
DatavalueFootnote 187853 non-null object
LowConfidenceLimit 246819 non-null float64
HighConfidenceLimit 246819 non-null float64
StratificationCategory1 403984 non-null object
Stratification1 403984 non-null object
GeoLocation 401416 non-null object
LocationID 403984 non-null int64
TopicID 403984 non-null object
QuestionID 403984 non-null object
DataValueTypeID 403984 non-null object
StratificationCategoryID1 403984 non-null object
StratificationID1 403984 non-null object
dtypes: float64(3), int64(3), object(18)

With df_new, the seaborn heatmap shows minimal yellow and mostly purple. If we wanted to go further, we could fill in the missing data, but at this time, I’ll leave additional work for a later stage.

To recap, I imported the CSV data file into a dataframe using pandas. Then I used various approaches to better understand the data within each column since there was very limited contextual information. The group of stratification 2 and 3 columns were not useful and these were removed. In the next post, we’ll take the resulting dataframe to understand the data even further to understand the relationships of specific indicators.