The Big Three Sticks (of Data Science Tasks)

andrew wong
Human Science AI
Published in
7 min readOct 28, 2019

INTRODUCTION

Quite often, data scientists that I have met so far said that they jealously guard their waking hours to work on things that makes them feel good. Obviously, the next question that I asked them is “what do you mean by feeling good”? And here is The Feel Good List of Data Scientists (this is not an exhaustive list) that they have shared with me over the past few years:

  • Developing new predictive algorithm (Read: I want to be a creator!).
  • Translating vast amount of data into data storytelling (Read: I want to tell the world about my work!).
  • Productioning Jupyter Notebook into something useful (Read: I want someone to use the fruit of my hardwork).
  • Concatenating different datasets / databases into one (Read: I want uncover something new).
  • Improving predictive modeling (Read: I want to get closer to the truth).
  • Learning new tools, techniques, etc. (Read: I am a forever learner).

The above The Feel Good List of Data Scientists mostly spot on (I can attest to this list myself!). What is surprising to me though, there is little mention about some of the more important data scientist tasks such as Data Scrubbing, Exploratory Data Analysis, and Feature Engineering.

Yes, they are time consuming.

Yes, these tasks can be a drag on Monday (or Tuesday) morning.

Yes, it can be repetitive.

The intention of this article is to bring to the front the importance of these tasks. I call these three tasks — The Big Three Sticks. Not only, The Big Three Sticks consuming a majority of data scientists’ time and effort (70% just on Data Scrubbing alone, on average), but also these tasks have a wide blast radius (Read: consequential towards your models and accurate interpretation of insights).

I have listed out most of the tasks contained in the The Three Big Sticks (please feedback in the comment section, if you can add to the list). If you are new to the data science world, you may want to take it all in (that’s your majority of waking hours right in front of you!).

DEEP DIVE INTO THE BIG THREE STICKS

The big three sticks of data scientist tasks

Data Scrubbing

There are many interchangeable names for Data Scrubbing (often names like data pre-processing or data cleansing comes into picture). A short, data science definition for Data Scrubbing: It is a set of focus activities on noisy data by removing missing data, duplication, multicollinearity, and transposing into machine readable data. The definition of done for Data Scrubbing: Dataset ready for modeling.

Pro-Tips for Data Scrubbing:

  1. Focus on wide blast radius. Expect to invest, at least 70% of your time here. The contribution-to-good-data-insight (CGDI) is very high here. What do I mean by CGDI? The higher the CGDI, the larger the blast radius of this work.
  2. Good data grounding. Run some initial data exploration first to get a feel of the data, read through at least 30 to 40 rows to really get a good grounding.
  3. Think ahead. Think ahead of what you want to do with the data, and ask this question: What are the top ten features that I will likely to use? From here, drill down into these top ten features to really make sure they are in a machine readable state.
  4. Draw your data. If possible draw up a diagram (yes, literally draw up diagrams representative the features) of the features that you want to scrub. This will get you grounded — seeing the bigger picture, drawing the links, etc.
A sample of my drawing prior to running SQL query

Exploratory Data Analysis

Exploratory Data Analysis (or in short EDA) typically refers to the process of exploring data to find patterns, and data stories (Read: understand what kind of data out there, and whether we can pull insights without running predictive modeling). A short, data science definition for EDA: It is a set of focus activities on understanding the data through visualisation and further investigation into data. The definition of done for EDA: Data patterns ready for feature engineering.

Pro-Tips for EDA:

  1. Big picture (literally). The first thing data scientists think about EDA is visualisation (and, that’s not wrong). In fact, EDA is mainly about data visualisation for the purpose of: (a) get grounded with the data, (b) finding patterns, (c) synthesising the dataset with others through visualisation, (d) initial selling to business sponsor to continue further work / research. I will walk through most of these points in the following points.
  2. EDA as a storybook. Do EDA well on a Jupyter Notebook (or some other platforms), it will become a wonderful storybook that you can walk-through with others. While you are synthesising the dataset, you and others will be thinking about the potential hypothesis or predictive modeling that you can make up in the next few days / weeks.
  3. EDA as a map maker. As you are working on the EDA, you will start to consciously or unconsciously stitching the datasets, and form early mind map of the dataset. From here, I encourage you to look out for trail of interesting data and guidepost of key insights. By trail of interesting data, I mean start draw out data trail so you can trace back and forward — just like a boy scout figuring out what’s interesting trails prior to venture into the forest. By guidepost of key insights, I mean start to figure out key business problems (Read: guideposts) so you can be sure you are not lost sight of what is essential/the goal.
  4. EDA as a team builder. This is a rather strange statement. In the first instance, yes. However, if you broaden your mind a bit, and think of EDA as a means to get everyone excited about potential insights (i.e. reviewing the data visuals), and more importantly, getting everyone on the same page.
  5. EDA as a marketing tool. I encourage you to use the initial EDA results as a marketing tool. Business sponsors and decision-makers love the sharp-and-short visualisation. We, Data Scientists, will need to get comfortable with the fact that there will be a lot of marketing and selling our data. This is a necessity to get funding, and executive sponsorship.

Feature Engineering

The Feature Engineering is the more interesting tasks of the Big Three Sticks. It is essentially like building something with Lego blocks. The Lego blocks are the Features. The building part is the Predictive Modelling. A short, data science definition for Feature Engineering: The application of domain knowledge to select and filter relevant features (Read: columns) to make it better for machine learning algorithm to work. The definition of done for Feature Engineering: Data ready for better modeling results.

Feature Engineering Workflow (Within the Data Science Workflow)

Pro-Tips for Feature Engineering (FE):

  1. Data Engineering:

a. Focus on bringing in other peripheral datasets. While working on exploratory data analysis, you may found out other datasets that will be useful to be incorporated, as part of the main dataset. This is the creative mixing part of FE. If we are doing this FE part well, we can further advanced the knowledge of your peer. The opportunity here — there is value colliding other datasets — the external synergistic value.

b. Focus on predictor variable. This is a warning. FE should only focus on predictor variables, and not the target variable. The target variable should be untouched.

c. Focus on wrangling different data. Similarly with bringing in other peripheral datasets; this FE activity focuses on what within the existing dataset. The opportunity here — there is value in colliding two columns or features — the internal synergistic value.

2. Domain Expert: Involve the domain knowledge experts as earlier as possible to help you with features with predictive power — essentially the advice here is: do not boil the ocean.

3. Debugging data: Feature engineering as a debugging tool. By debugging tool, I mean using feature engineering not just at beginning of the modeling, but feature engineering can be used post-modelling for the purpose of data and model error remediation.

TAKEAWAYS

The main takeaways of this Pocket Guidebook are:

  1. The Three Big Sticks of Data Science tasks are a reminder to us: This is the 80/20 Rules of Data Science. There are many activities that you can undertake as a data scientist, however if you get the Three Big Sticks right, you are in the right direction.
  2. If you want to minimise re-work or waste in your work — focus improving on data scrubbing.
  3. If you want to get your data storytelling right — focus on improving exploratory data analysis.
  4. If you want to get your modelling right — focus improving on feature engineering.

END NOTES

This blog is part of the Data Scientist Pocket Guidebook Series (Please check out a similar guidebook series with more focus on product side of data science — Product Data Scientist Pocket Guidebook Series).

Hopefully, this will be a handy reference that will help you navigate the basics on trending and challenging data science and machine learning topics. Ideal for aspiring data scientists and machine learning engineers who wants to get pro-tips and case studies.

--

--