Using GTD Productivity Method to Understand Data Science Lifecycles like CRISP-DM
Ever fallen down a rabbit hole of productivity — seeking to be the most effective person possible? If you’re like me, you’ve read all the books:
- Cal Newport’s Deep Work
- Daniel Kahneman’s Thinking Fast and Slow
- Daniel Levitin’s The Organized Mind
- Stephen Covey’s Seven Habits of Highly Effective People
- David Allen’s Get Things Done, and many more…
Perhaps you’ve even downloaded all the apps…
- OmniFocus, etc.
Well, I’m not here to bring you out of the hole. Rather, I’ll provide you a new way to apply the theoretical framework of that last book, David Allen’s Get Things Done — to understanding the lifecycle of data science.
Having just started studying Data Science, like any new student (or former student learning to learn again…) I’ve found the quickest way to grasp a new concept is to see the applications of a similar a theory in a familiar context. (Sidebar — shoutout to all my Georgetown University SFS Hoyas who graduated with the solid skills of…understanding paradigms and paradoxes :-P). So at the beginning of my course I read the second lesson about the “Data Science Process”, which entailed the following steps: Business Understanding & Domain Knowledge, Data Mining, Data Cleaning, Data Exploration, Feature Engineering, Predictive Modeling, Data Visualization, which sounded awfully similar to the steps required in the GTD process. Before getting too carried away, I’ll start with the basics of the GTD Workflow,
Getting Things Done Productivity Workflow
Let’s start with what I (think I) know, the process from David Allen’s Getting Things Done:
- 1. Capture
- 2. Clarify
- 3. Organize
- 4. Reflect
- 5. Engage
Essentially, this step requires establishing a systematic method for capturing any possible input into what he terms your “in-tray”. In the broader world of productivity, an in-tray includes dozens, if not hundreds or thousands of input daily. Things like birthday reminders, utility bills, office projects, pet issues, side-hustle development, weddings, laundry…all those things are inputs requiring some sort of output. The way he recommends to best capture everything to effectively parse through the in-tray requires first starting with an efficient environment for managing the load.
For example, my personal system (don’t judge me here — it still requires a lot of optimization!) includes: a small pocket notebook and Google Keep for notes on the go, a medium sized Moleskine notebook for longer/more complex thoughts or notes, and a corner spot on the desk in my room for physical items.
After establishing and implementing a systematic way of capturing inputs in a consistent manner, we need to understand how we’re going to handle this stuff. Specifically, we need to first understand if the input is actionable or not. Call it an “if statement”, but let’s not get ahead of ourselves. Perhaps later I’ll write some Python code for the GTD method. If items are non-actionable: eliminate, incubate, or file. If items are actionable (or, ‘Else’): separate multistep projects, do items that will take less than 2 minutes, and delegate or defer items. The key part of this process is to determine the “next action” that will be required to complete this task or project. This will equip us with the ability to implement the organization phase.
For example, in my notebook/Google Keep I have a grocery list and blog ideas, in my Moleskine a list of stuff to buy and assignments to follow up on, and on my desk I have a Pepco bill and a USPS letter. So when I clarify, I’ll open the USPS letter to determine the next action (since it’s <2 mins): which results in seeing the contents are all coupons, so I can eliminate it. The blog ideas I’ll incubate since they require marination and it’s not currently actionable. The grocery list I’ll file digitally so that I can grab it when I’m at the store. Assignments I’ll separate since it requires breaking down — I’ll probably end up delegating them to my calendar.
Organize means putting everything where it belongs. The method specifically recommends having “containers” for each item, or using “lists” to sort through, process, and organize everything. I know where your mind is going but let’s stay grounded here for just a few more lines, we’re not finished organizing. The other important part of organization is establishing contexts for each item. Assigning contexts allows us to be flexible amidst the distractions of daily life and remain productive
For example, my grocery store list is stored in my Google Keep container to which I’ve assigned the grocery store location context — activated by the app when I’m at the grocery store.
This step should be pretty straightforward. Establish a periodic basis on which to accomplish three things: 1) perform what I’ve decided to call a “capture dump” in which your in-trays are emptied and organized again (an iteration of sorts, but “get your head out of the cloud”…), 2) review your system by analyzing if it’s working effectively and completely to achieve your priorities, and make modifications if necessary, and finally 3) complete, clean, and clear your mental space to begin the process again.
For example, every Sunday I’ll gather my pocket notebook, moleskine, Google Keep on my phone, and sit at my desk with the physical items in the corner and look backwards — capture any items on lists not crossed out, review my system and notice that it didn’t capture everything in the first place (e.g., the random post-its and whiteboard notes…), and then 3) clear the old lists and start fresh.
Sounds like a lot of work to do before actually doing any work! Engage is just the step for those actions. The point of all those actions, though, was to be able to trust the system that we established, and no longer spend time on finding things where they were supposed to be, but instead re-harnessing that energy to be effective and “Get Things Done”. The method breaks this step down into lists within lists within lists…while that makes me happy as hell, it’s probably not the most effective way to summarize or communicate. Suffice it to say that this has 3 models for action: 1) actions in the moment, 2) daily work, and 3) reviewing your own work. While I won’t go into detail on each model, I’ll extract the key points from those lists (sick of the puns yet?) a few things.
- Moment: using the context, resources (time, energy), priority criteria
- Daily: Predefined work, new work, and defining work
- Reviewing own work: principles, vision, goals, focus areas, current projects, current actions
Applying the selected methods (again, seriously?!) while engaging with each item will help to decide how to be most productive in each context.
GTD and Data Science
If you haven’t already seen the natural applicability of the method, and my puns haven’t helped, well then I’ll make the connections explicitly here. The GTD methodology has a lot of similarities to the CRISP DM methodology, but that’s a topic for another day. For now let’s just look generally at how Data Science (referencing Python here) fits within each step of this productivity method (and how it doesn’t). I promise this is my last list!
Perhaps the easiest step to see the similarities — the capture step in Data Science requires gathering every source of available data in order to make use of it in the analysis process. We have to minimize our capture methods since our tools are limited to processing certain types of data (for now). With the help of data engineers, we establish a systematic method for collecting the data to be analyzed. Then we’ll set ourselves up for success with an efficient environment for processing the data, perhaps using tools like Git, Jupyter Notebook, Python, two monitors and a large cup of espresso(obviously). The in-tray would be appropriate the directories, databases, or systems in which we decide to keep all these inputs.
In Data Science, this step is often split by the same non-actionable and actionable categories. Non-actionable items we simply eliminate from the dataset. Perhaps we could file them for reference on another project or “incubate” them in this project by not filtering them out completely. Actionable items we separate so that we can accurately determine their next actions required for proper analysis. In other words, the clarify step can be compared cleaning the data and making sure that we have datasets that are ready for proper organization, or analysis.
Organizing in the GTD productivity method should feel similar, almost literally. Creating lists and containers is almost the exact terminology used in a language like Python — lists, dictionaries, defining variables, and all that fun stuff. Establishing the right context these lists, like importing Pandas, NumPy, MatPlotLib, etc. allows for the most effective understanding and analyzing the datasets. We want to make sure things are labeled correctly and that everything is exactly where we want it and in the shape we want it before performing analysis on it, lest our results be skewed.
By now, you can probably connect the dots towards Machine Learning and A.I. But before getting ahead of ourselves, the base level similarity is important as well. In Python (in Jupyter Notebook, for example), we review our code on a periodic basis —at almost every line we write or function we code. We’re analyzing whether or not our capture system worked effectively, and ensuring that we’ve clarified and organized the datasets in a logistical enough fashion that we can perform basic analysis. Python will return errors if we don’t! In the bigger picture, we use Machine Learning to review our system by creating models and ensuring their accuracy. One small difference here is that we don’t complete, clean, and clear our space every other line — perhaps only between projects. Another major difference in Data Science is that while we make the necessary adjustments as we go, we must also review the logistical process and workflow of the data we’ve cleaned to ensure its alignment with our original objectives, which leads to the last step of the GTD method:
This last step has various applications across the Data Science process.
- In the Moment: applying this model starts before the process by understanding what context to apply to these particular datasets, knowing the computational and personnel resources available for analysis, and determining what our priority outputs are for the inputs given.
- Daily: this model can apply to each action we perform while in the nitty-gritty of data analysis — knowing exactly what we mean by predefined work (e.g., cleaning, data prep, etc.), defining work (knowing exactly what types of analysis are going to be performed), and establishing a process for new work (e.g., after high-level visualizations of the data, knowing what kinds of further statistical analyses we will perform).
- Reviewing Own Work: applies throughout every step of the process, but towards the end allows us to re-evaluate if our priorities, analyses, and methods are properly aligned with the business vision and goals and general ethics. We can make sure that our current actions and current projects are aligned here as well.
The Relationship between the GTD Methodology and the Data Science Lifecycle
In using the GTD methodology to understand the Data Science lifecycle, I might perhaps re-arrange things into this order:
- Engage part 2
This re-arrangement frees up the resources required for the subsequent steps by engaging with the business needs and stakeholders before collecting data — kind of like Marie Kondo-ing. Knowing which data to toss, knowing which to donate (or save for later, incubate), and which to organize will lighten the workload on the latter category. When we clarify what we’re doing with the data, we can capture more effectively, by creating an effective environment for the task at hand. Then we can organize the data by putting it where we want it for our tailored needs. Understanding the business needs ahead of time will allow us to engage with the datasets again in a more effective way, knowing precisely which types of statistical analyses are appropriate in this context. Finally, the reflect process creates space for modeling and using machine learning to do more with the data, and to perform the analysis we initially set out to do.
In my next post we’ll take a closer look at the CRISPR-DM methodology which is tailored more towards the Data Science field. However, having a solid foundation and understanding of how the GTD productivity method applies to the field provides a solid framework for understanding the Data Science process.