With the end of week three, I am already a quarter of the way through the General Assembly Data Science Immersive bootcamp. This week, the cohort again covered a combination of statistics (t-tests, chi-squared tests of independence, Cohen’s d, and more), as well as more pandas and SQL.
Over this past week, I encountered a tricky problem. A dataframe had a column named order_id, which contained repeated values (see left). The objective was to create a sub_id column, which indexed the line(s) within each order_id. I solved this problem by looping if-else statements, which checked that the order_id from the previous row matched the current row order_id. This worked, but was time-consuming to write.
Alternatively, the problem set solution involved a custom function, and was beautiful. The custom function is applied to a dataframe grouped by order_id. The function splits the grouped dataframe up by order_id. Working order_id group at a time, the function creates an array of sequential whole numbers from zero to the number of rows in each order_id, adds one to each element in the array, and finally fills the sub_id column with the array. This process is repeated for each order_id.
This approach is known as split-apply-combine. While effectively utilising groupby objects has been tricky, it’s something I intend to work on. At the very least, I have the foundation to use what I learned from this solution in future problems. Additionally, should working the custom function solution be too time consuming, I can always fall back on my original solution.
The important thing as always is to be able to solve the problem one way or another.