What’s Wrong With Python Pandas?
Watch out for these problems, or raise issues and fix them!
Complex data and use-cases lead to complex problems.
In my previous blog, I nudged you to get started with pandas and showed why it is important to get a good hold of it before moving on to machine learning. But there are a few things you need to watch out for. When I was using Pandas to deal with real-time data and create a reporting platform these issues were show stoppers. Keep an eye for these problems it might become a deal-breaker if you are looking at Pandas to build analytics pipelines or platforms.
Where It All Started
As I mentioned, I was trying to incorporate Pandas into the data pipeline to create reports out of it. Some of the common use-cases include
- Selecting specific columns from Dataframe
- Grouping
- Constructing pivot tables
There is one common attribute to all these use-cases. Once the Dataframe is constructed the data needs to be sent to the front-end via an API. The closest possible thing you will look for is converting the dataframes to dictionaries and that’s Problem 1
1. Converting Dataframes into Dictionaries
We are using the same sales data that we used in the previous blogs for this part.
One of the best things about Dataframe is it's out of the box methods to convert data into required formats (CSV, JSON etc.,). to_dict
is one such method to transform them into a python dictionary.
df = pd.read_csv(PATH_TO_DATA)
df.head()
Now, as per our use case let’s use theto_dict
method to create a dictionary. But the excitement is lost the moment you see the result.
df.to_dict()
The to_dict
method comes with 5 different orientations, by default it is dict
orientation. For each column, it goes through each index and creates a dict with the corresponding value
For our requirement, the Orient records
seems to be a good fit.
df.to_dict(orient="records")
2. Grouped DataFrames into Dictionaries
We saw how easy it was to convert a Dataframe into an API response. Similarly, you can convert any dataframe, even grouped ones as a response through an API endpoint.
Let’s do some simple grouping.
grouped_df = df.groupby("Country").sum()
Going by the hunch we got from using the default to_dict
method, if we use orient=records
we end up with OOPS!!! that doesn’t seem right!
'records' : list like [{column -> value}, … , {column -> value}]
As the documentation says, orient=record
takes only the columns into consideration. Leaving the rows containing the country details of the groupby. Which makes the resulting dict meaningless.
At this point, we have to get clear on what our final dict should look like. You might think it is obvious, but for Pandas, it’s not.
Structure 1 - Country wise
{"Africa": {"Total": 5678, "Quanitity": 6789}}
{"Index":{"Column": "value"}}
On looking up the Pandas documentation orient="index"
will suit the requirement
Structure 2 - All the totals and Quatities
{'Total': {'Africa': 460000, ...}, "Quantity":{"Africa":..}}
{"Column":{"Index": "value"}}
More Columns More Problems
As you add up more columns to your grouping, the Pandas index stacks up and the dict keys become tuples
instead of str
making it literally unusable. None of the orients
like dict, records, index
can save us this time.
3. JSON into Dataframes
So far we have seen data being loaded from CSV files, which means for each key there is going to be exactly one value. Since this section needs a more complicated nested data, I have used my medium blog results via API(nested dictionaries and list). It is common to hit these roadblocks in real-world scenarios when we load data from external APIs.
Pandas becomes a huge pain when we deal with data that is deeply nested. Each blog data is under a key called node
and the author and statistical information are under nested keys virtuals and author
respectively.
Let’s see what happens on loading this JSON into a Dataframe. It seems obvious that we can’t utilize this Dataframe in any way since there are no columns or indexes to play around with.
Pythonic Solution
A pythonic solution to this problem is to loop through each item on JSON and use the flattened items to create a Dataframe.
data = None
with open(PATH_TO_DATA) as json_data:
data = json.load(json_data)
data = [d["node"] for d in data]
df = pd.DataFrame(data)
df.head()
The result seems much better than the one we had previously. But we still are losing the important information to nested dictionaries. The keys author and virtuals are still nested.
For example, we cannot solve the following use cases
- Group blogs by the user.
- Find the avg number of claps.
- Find Avg Reading time for blogs written by users.
We can’t achieve these unless we flatten to the level of users
and virtuals.
Relying on the pythonic way becomes complicated since it means you are looping over all keys at all levels
Using json_normalize
Pandas provides a method called json_normalize
that normalizes(flattens) dictionaries at all the levels.
import json
data = None
# Read data into a dictionary
with open(PATH_TO_DATA) as json_data:
data = json.load(json_data)# Normalize using json_normalize
df = json_normalize(data)
df.head()
Though it might have sounded like a feasible solution, it comes with its own set of problems.
- Column Names — The column names are the path to keys in the nested dictionaries. For you to access an element from the Dataframe you need to know the entire path. This becomes a huge pain when the nesting goes to N-levels and you have to do grouping on the same.
df["node.virtuals.previewImage.imageId"]
2. Missing Customizations
- What if you want to stop normalizing at a particular level of the nesting.
- What if you want to ignore certain keys while normalizing. (Ignore PreviewImage key from normalizing)
- What if you want to normalize only specific keys
- What if you want to start normalizing data at a particular level. (eg., create a DataFrame with virtuals alone)
Pandas is an excellent package with various levels of optimization enabling people to build amazing data pipelines. But these are a few corner cases where it doesn’t fit the purpose like the ones I faced with when trying to incorporate it for a real-time reporting system.
This doesn’t mean you have to hate Pandas for life. These are some of the areas where you can contribute to Pandas. There are 600 open bugs, and 230 good first issues, a great supportive community and is super easy to start with.
Did you like what you read? Hold the “clap” button and give me a shoutout on twitter.