Analyze JSON Data Like a Pro Without Writing a Single Line of Code

A step towards simplifying JSON data analysis

Avi Chawla
Geek Culture

--

Photo by Ferenc Almasi on Unsplash

Motivation

JSONs (JavaScript Object Notation) are commonly used as a lightweight data-interchange format. What makes them super useful is that they are easy for humans to read and write.

Being a text-based data format makes them language-independent and compatible with programming languages including C, C++, C#, Java, JavaScript, Perl, Python, and many others.

Moreover, simple JSONs are usually easy for machines to parse and generate. This makes it a good choice for exchanging data between systems that might have different architectures. That is why a JSON is commonly used for transmitting and obtaining data from API requests.

Now while JSONs are widely-used, there are many situations where parsing them, especially in Python, can get challenging.

For instance, JSON is based on the syntax of JavaScript, which is a language that Python was not designed to interact with. This can make it difficult to use certain features of JSON, such as certain special characters, or nested data structures, in Python.

A JSON with a nested data structure is shown below:

JSON with Nested Data Structure (Image by Author)

The Problem

While analyzing real-world JSON datasets, it is likely to get complex, nasty-looking, nested JSONs.

Writing plain Python code to flatten them can be time-consuming, especially when your objective is to just run some quick analysis.

The Solution

To this end, let’s discuss a potential solution — Gigasheet, which in my opinion, is a revolutionary alternative to data analysis for JSONs.

I covered a detailed overview of transitioning from Pandas to “No-code Pandas” using Gigasheet in the blog below.

The focus of this blog is, however, slightly different.

In this blog, I will briefly demonstrate how analyzing JSON with Python can get complicated and how Gigasheet makes it easy to analyze JSON data.

Additionally, the two biggest advantages that come with a tool like Gigasheet are:

  1. You don’t have to write any code.
  2. You don’t have to worry about run-time or memory optimization.

That’s amazing. Let’s begin 🚀!

Dataset

For demonstration purposes, I will use the publicly available GitHub Archive JSON data.

A glimpse of this dataset is shown below:

An overview of the JSON data (Gif by Author).

As you can see, the dataset is complex, involving nested data structures. Moreover, the outermost entries of this dataset comprise 40–100 lines.

The objective is to find the number of Push, Fork, and Watch Events on each repository.

Python

First, let’s solve this problem using Python.

Let’s load the data using the JSON module.

import json

file = "github-events.json"

data = []
for line in open(file, 'r'):
data.append(json.loads(line))

Every entry in this JSON depicts an event. The value of the type indicator depicts the type of event, which in our case, should be either of [“PushEvent”, “ForkEvent”, “WatchEvent”].

Next, let’s filter the data.

filter_list = ["PushEvent", "ForkEvent", "WatchEvent"]

data_filtered = [event for event in data if event["type"] in filter_list]

The next step is to update the count corresponding to each event for every repository.

The repository is determined by the event[“repo”][“name”] value.

We can create the final event-count map by iterating the filtered data and incrementing the count corresponding to each event for a repo.

This is demonstrated below:

count_dict = {}

for event in data_filtered:

repo = event["repo"]["name"]

if repo not in count_dict:
count_dict[repo] = {"PushEvent":0,
"ForkEvent":0,
"WatchEvent":0}

event_type = event["type"]
count_dict[repo][event_type] += 1

The count_dict dictionary holds the count of Push, Fork, and Watch Events for each repo.

That was quite a work, wasn’t it?

We had to look through the data at times and check if we are selecting the right field or not. We also had to create a dictionary with corresponding keys to track the count of events.

Next, let’s look at how we can do this using Gigasheet.

Gigasheet

Here, all we have to do is upload the JSON file.

Gigasheet will automatically flatten it to a tabular format for better visualization. Isn’t that cool?

There’s also a video by Gigasheet where they processed over 150 GBs of JSON data existing as a list. Gigasheet automatically identified this list-like JSON structure and generated one CSV file for easier processing.

This JSON to CSV conversion takes away all the heavy lifting of iterating through the dataset, aligning the fields to flatten, and many more.

After uploading the JSON to Gigasheet, it also displays the number of rows and columns.

Rows and Columns in a Flattened JSON (Image by Author).

After opening the dataset, we first filter the dataset on the type of event and display only those columns we need.

This is shown below:

Next, we switch to the pivot mode and create a cross-tabulation of repo and type column, as shown below:

Pivoting the filtered dataset (Gif by Author)

And we are done!

Conclusion

In this blog, I demonstrated how you can leverage Gigasheet and analyze a JSON easily without worrying to flatten it.

While in Python, we had to manually load, filter, select the relevant rows and aggregate the data to obtain the event statistics. The same operation in Gigasheet, however, was simple and straightforward.

With Gigasheet, we saved time, wrote no code, and also avoided flattening the JSON. It automatically created a tabular representation of the nested JSON file for better readability.

Another cool thing about reading JSON is that

Besides being your go-to JSON analysis tool, you can do many more things with Gigasheet.

For instance, with Gigasheet, you can easily handle large datasets, which you may immensely struggle with while using Pandas. I covered this in one of my earlier blogs, which you can read below:

Moreover, if you work with Excel, Gigasheet can be a potential solution for you too. As you may know, the biggest issue with Excel is its max row limit. This inhibits working on large-scale data analytics projects, as Excel does not support that.

However, Gigasheet makes working with large datasets an absolute breeze.

I wrote a detailed article recently on the transition from Pandas to Gigasheet, which you can read here:

As always, thanks for reading!

--

--

Avi Chawla
Geek Culture

👉 Get a Free Data Science PDF (550+ pages) with 320+ tips by subscribing to my daily newsletter today: https://bit.ly/DailyDS.