alpharithms
Published in

alpharithms

Tabulate JSON Data in Python Using Pandas

Trying to tabulate JSON into a human-friendly format? Is your JSON data nested? In any case, the Pandas library can make short work of converting this common data format into a more reader-friendly presentation when needed!

data tabulate json python

Tabulating data is the process in which different segments are aligned in vertical columns and horizontal rows. This is a common presentation format for things like tables, spreadsheets, calendars, and business reports. Tabulating JSON data in Python can be easy when using the Pandas DataFrame object to do the heavy lifting.

JavaScript Object Notation (JSON) is a standardized data format common among web-based languages, Representational State Transfer (REST) APIs, and inter-application data messaging. Practically, the ability of JSON to be tabulated depends on the structure of the data. In this article, we’ll take a look at how to quickly tabulate JSON in Python using Pandas and also some cases where results might be poor.

Setup: Sample JSON Data

Before we begin tabulating JSON-formatted data let’s first create some sample data. We will create records of 25 random employees using the create_random_people_data function available via Github. We will then save this data to a local file named sample.json via the following code:

import json

# Generate data for 25 people
people = create_random_people_data(25)

# Write the output date as a json-formatted file
with open("sample-data.json", 'w')as file:
file.write(json.dumps(people))

This saves a file with the following data (some records removed for brevity):

[
{
"id": 1,
"first_name": "Mary",
"last_name": "Smith",
"location": {
"City": "New York",
"State": "New York"
},
"age": 32
},
...
{
"id": 25,
"first_name": "Pat",
"last_name": "Smith",
"location": {
"City": "Denver",
"State": "Colorado"
},
"age": 32
}
]

Tabulate JSON Data

To tabulate our JSON data we will use the Pandas library and make use of the DataFrame object. To approach this, the from_dict method can be used in conjunction with the json.load method as such:

import pandas as pd
import json

# open file
with open('sample.json', 'r')as file:
df = pd.DataFrame.from_dict(json.load(file))

# print first 5 records via head() method
print(df.head())

id first_name last_name location age
0 1 Mary Smith {'City': 'New York', 'State': 'New York'} 32
1 2 John Jones {'City': 'Los Angeles', 'State': 'California'} 18
2 3 Pat Smith {'City': 'Denver', 'State': 'Colorado'} 32
3 4 John Jones {'City': 'Los Angeles', 'State': 'California'} 42
4 5 Mary Jacobs {'City': 'Denver', 'State': 'Colorado'} 42

This is a good start however there are some issues. The id field should likely be used to index our data — this is a minor concern. The more problematic issue is that our location data is still shown as originally formatted, now reflecting Python’s dict format (from being parsed via json.load). To address these issues, we will use the normalize_json and set_index methods of the DataFrame as such:

import pandas as pd
import json

# open file
with open('sample.json', 'r')as file:

# load as dataframe using json_normalize method
df = pd.json_normalize(json.load(file))

# set the index to employee id
df.set_index(['id'], inplace=True)

# print first 5 records
print(df.head())

# output
first_name last_name age location.City location.State
id
1 Mary Smith 32 New York New York
2 John Jones 18 Los Angeles California
3 Pat Smith 32 Denver Colorado
4 John Jones 42 Los Angeles California
5 Mary Jacobs 42 Denver Colorado

Notice that Pandas has allowed us to read the nested JSON data accurately, creating appropriate column names location.City and location.State where appropriate.

Note: It’s always a good idea to check for missing values in Pandas prior to deeper analysis. Check out this article for some tips on handling missing data.

Parsing Lists in JSON

An interesting case of parsing JSON with Pandas, especially when tabulated views are prioritized, is that of lists. Imagine the data above had an entry for hobbies as such:

{
"id": 2,
"first_name": "Pat",
"last_name": "Williams",
"location": {
"City": "New York",
"State": "New York"
},
"age": 55,
"hobbies": [
"singing",
"painting",
"programming",
"fishing"
]
}

Here we see a list format for data in the hobbies field. If we parse this data using the pd.normalize_json function as before, we get the following output:

first_name last_name  age                                         hobbies  \
id
1 John Jacobs 60 [singing]
2 Pat Williams 55 [singing, painting, programming, fishing]
3 Mary Jacobs 18 [singing]
4 Alice Jones 24 [singing, painting, fishing, dancing, running]
5 Pat Anderson 24 [programming, fishing, singing]

Here we see a string-formatted version of our list-formatted data for the hobbies field for each random person. At first glance, this might seem like a parsing error. This is not the case however and merely how a list of values would be presented in a column format.

Note: the last two columns location.City and location.State have been removed here for brevity.

These issues would likely need to be addressed on a case-by-case basis whereby one instructs Pandas on how to display such data. For example, would it be ideal to have a hobby_1, hobby_2, hobby_3, hobby_4, and hobby_5 column? What if someone were to list more than 5 hobbies? Would you limit the number of hobbies a person could list? Would you simply ignore any hobbies listed beyond 5?

A generalized approach for formatting list data would be to remove the brackets and simply convert to CSV-format data as such:

# map a function to join all list items via a single ',' character
df = df.applymap(lambda x: ",".join(x) if type(x) is list else x)

This would convert the hobbies column into the following format:

first_name last_name  age                                   hobbies  \
id
1 John Jacobs 60 singing
2 Pat Williams 55 singing,painting,programming,fishing
3 Mary Jacobs 18 singing
4 Alice Jones 24 singing,painting,fishing,dancing,running
5 Pat Anderson 24 programming,fishing,singing

Whether this would be an ideal format would be a decision made by the developer. Personally, I feel dropping the brackets would allow for more flexible use among other tabulated formats — such as the case where the DataFrame was exported as a CSV-formatted file. Additionally, the purpose of brackets may not be well-recognized among non-programmer types in cases where data is intended to be exported and used by more general consumers such as business analysts.

Final Thoughts

Tabulating JSON in Python is straightforward when one leverages the power of Pandas. The normalize_json the method helps ensure that parsing nested JSON results in appropriate column names whereby nested keys are converted to dot-addressable column names.

The handling of certain data like lists can introduce rigidity into the approach one takes in parsing nested JSON. As such, many cases of parsing nested JSON into tabulated formats will require a process of data inspection followed by preferred case-by-case approaches. Such is the life of a programmer!

The article Tabulate JSON Data in Python Using Pandas first appeared on the alpharithms website and has been republished here with permission.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Zack West

Zack West

Entrepreneur, programmer, designer, and lifelong learner. Can be found taking notes from Mother Nature when not hammering away at the keyboard.