Baby Pandas!
I apologize (but only a little) for the click-bait title. Other than the image above, there will be no images of pandas, the bears.
Instead, I will be showcasing how I used Pandas, the Python library, along with some PyPlot and Seaborn, to analyze my baby’s feedings.
Welcome to the first entry in the Learn By Doing series!
The purpose of this post is not to give a full-blown tutorial on Pandas and other topics, but rather to show how I used this library and its tools to perform analysis of a specific data set, namely my baby’s bottle feedings.
While I won’t paste all of my code in this post, I will highlight some interesting and/or useful concepts and will provide the link to my GitHub repository at the end.
Lastly, we have been tracking our baby’s feeds with the Baby Tracker application, available for Android and IOS.
Here we go!
What is Pandas?
Pandas (not the bear), is a data manipulation and analysis library for the Python programming language, providing its own unique data structure, the Data Frame.
More information can be found at the Pandas homepage, including installation instructions.
Loading Data
First things first, we need to load the data into a Data Frame. Fortunately, the application exports its data in a CSV format, one of two formats readily supported by Pandas via a built-in function.
Pandas.read_csv()
- Load the csv-formatted data into a Data Frame using the Pandas.read_csv() function.
- There is also a Pandas.read_json() function for the JSON data format.
Filtering Data
We also use the application to track other things such as breast pump volume, sleep times, and diaper status among other things. Since I am only interested in bottle feedings, I need to filter out everything else.
Filtering by column values
- Filter by values in the RecordCategory and RecordSubCategory columns. I only wanted to look at Feeding and Bottle categories, respectively.
- Filtering by column values is as simple as indexing the Data Frame with a Boolean expression, as shown in Figure 2.
Splitting Strings
Looking at the data, I spot an issue that needs immediate attention. The amount of milk consumed is embedded in the Details column along with some informational but non-numeric text. I need to get the numerical amount into its own column in order to do any analysis on it.
Splitting using Series.str()
Luckily, Series objects, which are the columns of the DataFrame, have String methods built in, including replace and split. I use Series.split() to isolate the feed amount.
Working with DateTime
Before doing any more data manipulation, let’s visualize the data as it is currently so we can inform and adjust our next steps.
We first downloaded the tracker application when our baby was born, but we did not start regularly recording their feedings until we brought them home from the hospital, hence the gap in recordings. This gap in usage is not particularly useful so I will filter it out.
Filtering by a DateTime column
Here are the steps I took to exclude this gap.
- Convert the new start date from String to DateTime.date data type.
- Filter the data set by comparing values in the FeedDate column against the converted start date using a Boolean expression
- Note: The filtering above can also be achieved by casting the FeedDate column as a string using the Series.astype() method and comparing against a String date. The code below results in the same filtered data set.
Great, now let’s take a look at some statistics.
Grouping and Aggregating Data
To get some meaningful information out of this data set, I needed to calculate some statistics such as means, minimums, and maximums.
This was accomplished using DataFrame.groupby() in conjunction with DataFrame.aggregate() as detailed below.
DataFrame.groupby()
This function splits data into groups based on a given criteria. In this case that grouping criteria is the FeedDate value.
DataFrame.aggregate()
This function applies operations to one or more columns. For this study, I used max, min, mean, and sum aggregations
The result of using the above two functions is a new DataFrame object with the aggregated data.
Merging DataFrames
Notice in the previous section that I created a separate DataFrame for each statistical measurement (df_mean, df_sum, etc.). To simplify further work, I merged these DataFrames together using the DataFrame.merge() method.
DataFrame.merge()
Merge DataFrame or named Series objects with a database-style join.
Below I merge using the common column of FeedDate.
Cumulative and Moving Window Calculations
While the mean, min, and max aggregations in the previous two charts have their merit, our baby’s progress is not quite as obvious as I would like.
I wanted to smooth out the variation by implementing a cumulative average for long-term progress and a moving window average for short-term trends.
The DataFrame.expanding() and DataFrame.rolling() functions, respectively, made both simple to implement.
DataFrame.rolling()
This function provides functionality for rolling window calculations, calculations executed on a subset of the data in which the subset of data, or window, changes with each calculation. I used this feature to compute a 7-day average for each date.
DataFrame.expanding()
This function is similar to a rolling window calculation, except that only one side of the window changes. I used it to compute a cumulative average of feed amounts for each date. One edge of the window remained static at the first date (July 28, 2018), while the other edge is expanded to one additional date for each calculation.
Misc. Plots
With Pandas, I was able to effectively summarize our baby’s feeding trends and overall progress. That’s all for this episode!
Link to analysis scripts.
Note: Analysis code can be found on GitHub.