Pandas Pivot Tables & Assigning Unique IDs

Kat Hernandez
4 min readApr 3, 2022
Photo by Pam Menegakis on Unsplash

Yesterday I had a hankering for chocolate chip cookies. My usual recipe results in a sweet, crunchy cookie. Yesterday though, I downed my butter portion and added frothed milk until the consistency of the two ingredients was smooth. I substituted white for brown sugar. I used baking mix instead of my usual baking powder + salt + flour combo. The treats emerged from the oven as cakey, less-sugary cookies. They contrasted perfectly with the rich dark chocolate chunks I threw in. Yes, they’re delicious.

We think of chocolate chip cookies as an American staple; they’re everywhere. At the local gas station, you can find at least two varieties. Yet, the chocolate chip cookie didn’t exist until it was invented in New England in the 20th century. A staple so ingrained in our culture, it’s bound to have a few different ways to create. Like my switching up a usual recipe, I’d like to focus attention on another 20th-century invention: the Pivot Table. While they’re often created with Excel, my datasets for work often have over 100,000 rows of data. Pandas Pivot Tables handle datasets of this type with more ease.

In fitting with baking, the simulated dataset is a baking competition between 8 bakers. Each baker bakes one cookie that can be either chocolate chip or sugar and will be classified as soft or crunchy and given a rating on a scale of 1–10. It can be observed that I have recorded the cookie’s rating in the dataset, but created a “BakerId” using NumPy. In a real-world scenario, Hanna and Delta with ratings of “4” may be upset seeing the results amongst the other bakers:

Hanna and Delta’s “4” Rating may not be received well when seen here

Unique Id Assignment

The assignment for BakerId is where we create a method to mask the baker’s identity. Typically, np.random.randint is the function referenced to come up with a random whole-number assignment, but np.random.choice allows for the “replace=False”, or without replacement, to be called. This means the BakerId’s we created won’t be duplicated for more than one baker. We can also see we allowed the random sample to be between 0 and 10 and the “size=8” is the sample of random integers we want to create.

Pivot Table Creation

To examine the baking competition results further, look at the pivot table created using the inserted code below:

Baker ID’s mask comparison and index aggregates ratings

By assigning the BakerId, the Baker who received the “4” rating will not know which one of the bakers also made the chocolate chip crunch cookie that received a much higher rating of “8.” We simply used “BakerId” in the columns= selection. What’s more, is while we can see that the favorite cookie overall was sugar crunch, the soft chocolate chip cookie was on average higher rated. This was created by specifying np.mean in the aggregate function and the margins=True to include column subtotals on the far right column labeled “All”.

The pivot table now has a multi-level index, which is why we can see the “Type” aggregated first and the “Style” aggregated second. In Excel, you would have to drag the Row order around to change the data in a similar fashion. When you have thousands of rows of data, Excel takes a considerable amount of work on your computer’s part to accomplish this. Similar to this small scale-here, your table populates instantaneously:

With a few switches to the code, we can easily see that crunch-style cookies are typically favorably rated, while soft cookies vary much more on whether they are highly rated on not-depending on the cookie type.

By providing unique IDs on datasets of all sizes, NumPy provides a means to identify sensitive information without compromise. This isn’t just valuable to hide a baking fail. Infinite business cases call for information to be masked in a way to protect the data.

I’m guilty of getting large Excel files and the moment I receive them I’m clicking that “Insert> Pivot Table” in Excel. I’m just hoping for that quick analysis that won’t take more than a few minutes of my day. Doing so, I’ve had many computer freezes and many restarts with unsaved work lost. So let me not do the same actions over and expect different results. I’ll take my time to switch out how I create, whether it be blending frothed milk in place of butter for a recipe, or the code in a Jupyter notebook I write for a pivot table. Like my new chocolate chip cookie recipe, I hope that whatever I create, it can be just a tiny bit better than anything I’m accustomed to creating before.

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter and LinkedIn. Join our community Discord.

--

--

Kat Hernandez

Evangelist of Clear Analytics. Emancipator of Tech. Fancies Dangerous Trail Races. https://www.linkedin.com/in/katrecruiter/