Parsing JSON files using Python

Take a look at an easy guide on how to use Python to read JSON files.

Published in

Analytics Vidhya

11 min readDec 25, 2019

Death touches each one of us, but it simply is not a pleasant topic to talk about! But, contrary to that, in today’s world of data and analytics, we should be discussing death because using the mind-boggling amount of data that we generate everyday, we can look at the topic of death with a positive lens. What if medical practitioners could reduce the death count using data and help us live happily and enjoy our life? Could you imagine the impact it can have on all of us?

Medical deaths due to diseases have a higher chance of prevention and cure if treated correctly on time. Accurate information about different diseases affecting majority of the people in a given area would be invaluable for medical practitioners. They could advise people on the preventive measures, mark out a plan to research on vaccines targeted to prevent those diseases as well as formulate new drugs to cure patients. All of you would agree that in today’s world of data explosion, collecting accurate data would not be a problem!

In this article, we will investigate a similar data set which tells us about the leading causes of death in New York city since 2007. Year-wise trends from 2007 can help measure the outcomes of new medicines/vaccines and its effect on the death rate. It will also help us shortlist diseases which are becoming more prominent and need urgent attention. There are quite a few questions we could answer using this data set, including:

Does the death count vary based on sex?
What are the major cause of deaths per sex?
What are the major causes of death race-wise?
What is the year-wise pattern of causes of death?
What are the major causes of death in young people?

Enough of talking! Now, let us get started with the actual work. The data we have is a JSON file. What exactly is JSON and how are we going to parse it in Python? Let us peek at the beginning of our JSON file:

{'meta': {'view': {'id': 'jb7j-dtam',
   'name': 'New York City Leading Causes of Death',
   'attribution': 'Department of Health and Mental Hygiene (DOHMH)',
   'averageRating': 0,
   'category': 'Health',
   'createdAt': 1384963296,
   'description': 'The leading causes of death by sex and ethnicity in New York City in since 2007. Cause of death is derived from the NYC death certificate which is issued for every death that occurs in New York City.\n\n</p>\nReport last ran: 11/18/2018</br>\nRates based on small numbers (RSE > 30) as well as aggregate counts less than 5 have been suppressed in downloaded data</p>\nSource: Bureau of Vital Statistics and  New York City Department of Health and Mental Hygiene',

Yup, this looks confusing, but we will decode it using the below road map:

Understanding JSON
Reading from a JSON file and extracting it in a Data Frame
Cleaning the data
Analyzing data to get answers to our questions.

Understanding JSON

What is JSON? Why do we use JSON?

Before we jump into the actual definition of JSON, let us look at an example. Let us say that two people, one person speaking Hindi only and the other Chinese only, want to communicate with each other. How can they talk if they don’t understand each others languages? They would need a translator! But, wouldn’t it be easier and faster if they knew a common language? In the same way, say suppose JavaScript and Python want to communicate. They don’t know each others language, but if they know one common language, they can send and read messages. Also, they cannot read a dump of text messages, they need formatted messages with some grammar and syntax. A simple language independent format of organized messages which both languages can read is JSON. Both JavaScript and Python have built-in JSON libraries to convert JSON into their respective programming languages which they understand.

JSON objects are human-readable lists of name/value pairs. JSON data looks much like a dictionary would in Python, key:value pairs. This format encodes data structures like lists and dictionaries as strings to ensure that machines can read them easily. It can have nested name/value pairs as seen below for ‘address’.

{
"data": [
 {
  "id": "A001",
  "name": "John"
  "address": {
   "city": "Los Angeles",
   "state": "California",
   "zip": "90001"},
  "position": "Manager",
 "department": "Sales",
 "number of years": "10" 
},
 {
  "id": "A002",
  "name": "Ana"
  "address": {
   "city": "San Diego",
   "state": "California",
   "zip": "91911"},
  "position": "Analyst",
  "department": "Marketing",
 "number of years": "2" }
]
}

JSON is easier to parse, quicker to read and write. JSON syntax is based on JavaScript, but is in text format. Now that we know the basics of JSON and its structure, let us jump into our source JSON and read it in Python.

Reading from a JSON File and Extracting it in a Data Frame

Exploring the JSON file: Python comes with a built-in package called json for encoding and decoding JSON data and we will use the json.load function to load the file.

import json
file = open("NY_cause_of_death.json")
ny = json.load(file)

Even if Python has an in-built library, we still need to know how to find the data we need. In our JSON file, you will see two top level keys:

meta: This usually refers to information about the data itself. This key has other nested key: view.
data: This is the second key containing the actual data. (You can search this key in the big JSON file.)

ny.keys()
> dict_keys(['meta', 'data'])

When you explore the data key in the file, you will see a list-of-lists containing the actual data, but without column names as shown below.

'data': [['row-t4eu.wiij.n7je',
   '00000000-0000-0000-CAA9-9C2FF9B97733',
   0,
   1565986487,
   None,
   1565986487,
   None,
   '{ }',
   '2016',
   'Essential Hypertension and Renal Diseases (I10, I12)',
   'Male',
   'Other Race/ Ethnicity',
   None,
   None,
   None],
  ['row-v3af.mvmp-b2z4',
   '00000000-0000-0000-AA10-AA914EA7CA64',
   0,
   1565986487,
   None,
   1565986487,
   None,
   '{ }',
   '2016',
   'All Other Causes',
   'Female',
   'Non-Hispanic White',
   '2416',
   '172.4',
   '102.4'],
  ['row-iuzu~fmzm-zkwx',
   '00000000-0000-0000-0A32-492179ED2C6E',
   0,
   1565986487,
   None,
   1565986487,
   None,
   '{ }',
   '2016',
   'Accidents Except Drug Poisoning (V01-X39, X43, X45-X59, Y85-Y86)',
   'Female',
   'Non-Hispanic White',
   '156',
   '11.1',
   '7.2'],

So, before we explore how to extract the data, let us see how we can access the column names.

Getting header information (column names): Keeping in mind the key:value syntax, as you scroll down to the meta.view.columns key in the JSON file, you can see many items nested in the values. Items have key:value pairs starting with ‘id’ and ending with ‘flags’ keys.

"columns" : [ {
        "id" : -1,
        "name" : "sid",
        "dataTypeName" : "meta_data",
        "fieldName" : ":sid",
        "position" : 0,
        "renderTypeName" : "meta_data",
        "format" : { },
        "flags" : [ "hidden" ]
      }, {
        "id" : -1,
        "name" : "id",
        "dataTypeName" : "meta_data",
        "fieldName" : ":id",
        "position" : 0,
        "renderTypeName" : "meta_data",
        "format" : { },
        "flags" : [ "hidden" ]
      }, {
        "id" : -1,
        "name" : "position",
        "dataTypeName" : "meta_data",
        "fieldName" : ":position",
        "position" : 0,
        "renderTypeName" : "meta_data",
        "format" : { },
        "flags" : [ "hidden" ]
      }, {
        "id" : -1,
        "name" : "created_at",
        "dataTypeName" : "meta_data",
        "fieldName" : ":created_at",
        "position" : 0,
        "renderTypeName" : "meta_data",
        "format" : { },
        "flags" : [ "hidden" ]
      }

If you observe the above code snippet, you can see that the ‘fieldName’ key has values like sid, id, position, created_at. As you open and scroll down the JSON file, you will see other values like year, leading_cause, sex, etc. Looking at these values, we can say that ‘fieldName’ is the relevant key containing information about the column names that we need. We just have to extract the ‘fieldName’ key from each item chunk in ‘columns’. Let us look at the code for extracting the column names.

import ijson#Specify the file we want to extract data from
filename = "NY_cause_of_death.json"
with open(filename, 'r') as f:
    #Extract each item from the meta.view.columns list
    objects = ijson.items(f, 'meta.view.columns.item') 
    columns = list(objects) #Convert generator to list

We will use the ijson library which iteratively parses the json file instead of reading it all in at once. This is slower than directly reading the whole file in, but it enables us to work with large files that can’t fit in memory.

After opening the file, we will use the ijson.items() method to extract a list from the path of meta.view.columns from the JSON file. Here, meta.view.columns.item indicates that we will extract each individual item in the meta.view.columns list. The ‘items’ method will return a generator and we can use the list method to convert it into a Python list. The first item of the list ‘columns’ looks as given below:

{'id': -1, 'name': 'sid', 'dataTypeName': 'meta_data', 'fieldName': ':sid', 'position': 0, 'renderTypeName': 'meta_data', 'format': {}, 'flags': ['hidden']}

Each item in ‘columns’ list is a dictionary that contains information about each column. You can double check this is the above snippet of the JSON file. As discussed, ‘fieldName’ key is important for us. Let us look at it in detail.

column_names = [col["fieldName"] for col in columns][':sid',
 ':id',
 ':position',
 ':created_at',
 ':created_meta',
 ':updated_at',
 ':updated_meta',
 ':meta',
 'year',
 'leading_cause',
 'sex',
 'race_ethnicity',
 'deaths',
 'death_rate',
 'age_adjusted_death_rate']

Perfect, we were correct about the ‘fieldName’ key and we can see that it has the columns that we need! Now, we can move to extracting the data for these columns.

Extracting Data: Let us select the columns that we would need for our analysis.

good_columns = [
    'year',
 'leading_cause',
 'sex',
 'race_ethnicity',
 'deaths',
 'death_rate',
 'age_adjusted_death_rate']

As discussed above, the top-level key ‘data’ has the actual data locked in a list-of-lists. We can now use the columns extracted in ‘good_columns’ to fetch data only for columns we need.

data = []#Extract data
with open(filename, 'r') as f:
    objects = ijson.items(f, 'data.item')
    for row in objects:
        selected_row = []
        for item in good_columns:
            selected_row.append(row[column_names.index(item)])
        data.append(selected_row)

The ijson.items() method will return each item in the data key using data.item path.
The first for loop will loop through each item in data key giving us a row-wise list.
Once we have the row-data, we now want to extract data only for the ‘good_columns’. Hence, we will use another for loop to go through each column data in the selected row. We will find the position for our selected columns using: column_names.index(item) where item will be from ‘good_columns’.
row[column_names.index(item)] will give us the data for the ‘good_columns’ in each row, which we will append to our list ‘selected row’. In this way, we will select data for each columns of ‘good_columns’ in that row.
Once this is done, we will follow the same procedure for each row and append our final data which will be a list-of-lists to ‘data’. We will then convert it into a data-frame for easier analysis.

data = pd.DataFrame(data, columns=good_columns)

Awesome! Our data from the JSON file is ready in a data frame for analysis.

We can now move on to data cleaning and analysis to answer our questions.

Cleaning the Data

This is an important step before we go ahead with our analysis as clean data (without null values, correct text, correct data types) is a prerequisite for any calculations and inferences. We can work on the below points. For details you can refer to the notebook.

Change None, null and ‘.’ values in ‘deaths’ column to 0 and convert it to numeric.
‘Sex’ column has different values: ‘F’ and ‘Female’, ‘M’ and ‘Male’. We will convert all values to ‘Female’ and ‘Male’.
Change ‘.’ values in ‘death_rate’ column to 0 and convert it to numeric.
‘leading_cause’ is an important column as it is giving us important information about the death, but it has some numbers which we will remove to clean the column values.

Analyzing Data to get Answers to our Questions

For detailed analysis, you can refer the notebook.

Does the death count vary based on sex?

We can see that the number of deaths in males and females are almost equal with female count slightly higher.

Are number of deaths for each sex varying across years?

Both male and female death counts from 2007–2016 have been more than 25000 each year. Female death count was highest in 2008, while male death count was in 2016.
There is a dip in the female death count from 2009–2014, but after that it has increased again. Male death count has increased consistently from 2009 on wards.

What are the major cause of deaths per sex?

Male and Female leading causes of death are diseases of heart, malignant cancer. Number of female deaths due to these two causes has been higher than males. High counts maybe due to improper lifestyle of junk food, chemical exposure, habits, overall lifestyle and stress.
Assault and Parkinson’s disease are affecting males only. Males have a higher chance of death due to accidents except drug poisoning, chronic liver disease, HIV, suicide, mental disorders — accidental poisoning.
Females have a higher chance of dying due to Alzheimer’s disease, cerebrovascular disease, respiratory diseases, hypertension and renal diseases, influenza and pneumonia, nephritis, septicemia.

What are the major causes of death race-wise?

Along with diseases of heart, cancer, diabetes, influenza and pneumonia, some of the prominent causes of death for different races are:

Non-Hispanic White: Accidents except drug poisoning, Alzheimer’s disease, cerebrovascular disease, respiratory diseases, intentional self-harm, mental disorders - accidental poisoning and septicemia.
Non-Hispanic Black: Prominent causes are assault, cerebrovascular disease, respiratory diseases, hypertension and renal diseases.
Hispanic: accidents except drug poisoning, cerebrovascular disease, respiratory diseases, HIV, mental disorders due to accidental poisoning.
Asian and Pacific Islander: Cerebrovascular disease.

What is the year-wise pattern of causes of death?

As seen above the leading cause of death in all buckets of three years was diseases of heart followed by cancer.
Deaths due to Alzheimer’s disease, cerebrovascular disease, liver disease, respiratory diseases, diabetes, hypertension and renal diseases, mental disorders - accidental poisoning, Parkinson’s disease, perinatal conditions have increased over the years. More research on how to develop drugs to avoid or cure these diseases can be undertaken.
Deaths due to assault, congenital malformations, deformations, HIV, nephritis and hepatitis have decreased over the years. This information can be useful to see if new drugs are effective in reducing the count.
Parkinson’s disease has been the major cause of deaths in 2014–2017 bracket which did not cause death in earlier years.

What are the common causes of death in young people?

Across years, Hispanic, non-Hispanic black and Asian males are dying due to heart diseases which is quite alarming. Reasons for this could be consumption of more processed foods, lack of exercise due to sedentary lifestyle, obesity, smoking, etc. Prevention is the best cure and medical practitioners can provide guidance to prevent.
More number of young Hispanic males are getting affected among all the races.More research needs to be done on why more Hispanic males are getting affected.

Conclusion

JSON is a simple language-independent format of organized messages which different languages can read. Most of the languages have libraries to parse JSON.
JSON data looks much like a dictionary would in Python, with key:value pairs.
We can read any JSON file by looking at the top level keys and extract the column names and data using the json or ijson library.