The Quick Wins (and the Long Game) of Datasets

andrew wong
Human Science AI
Published in
6 min readOct 30, 2019

For data scientists, our world is bounded by what data we have. Each time we gain new features or predictor variables, our well-trained models will need to be retrained and our models become more valuable (hopefully, that’s the case).

Each time we gain new features or predictor variables, the edges of our data world expand. These are exciting, yet challenging times. When we are gaining more and more data, we enlarge our sense of possibility. Our range of so called known universe of a particular subject expands.

After many years of working through various datasets (Read: false starts), I have learned smarter ways of working through datasets (Read: less false starts). The purpose of this article is to inform and share with you, data scientists who are facing similar challenges. More specifically, it is a guide for aspiring data scientist find the right kind of data — in rapid iteration — so we can win the long game.

Prior to me jumping into the real stuff, I am going to be a bit whimsical here.

Imagine this scenario: Remember the time watching clouds as a child? You are seeing different shapes and forms of clouds. You start to realise and imagine clouds of Ship, Sheep, and Snowball. And, depending on the wind and pressure of that day, the shapes and forms of clouds changes. And, what you have imagined earlier will likely to change. Each time we come to a new imagination of clouds, we feel and think a difference.

My question to you — can we slow down and ‘watch the clouds forming’? Being data scientists, we can be impatient bunch of people. Usually we will rush through the data science workflow. First, extracting datasets. Then, pre-processing data for missing values, multicollinearity, etc. Next, exploring the data for any insights through visualisation and rapid interpretation of what we can conclude. If we are working on predictive modeling, we will get into logistics regression, support vector machine, etc. Finally, interpreting our hard-work with a few world-changing insights and future scope (and perhaps a long list of limitations).

My intent here is to s — l — o — w you d — o — w — n to experience the deepening experience of time watching clouds as a child. Let your imagination linger and flow a bit more longer. I will be your guide in the next 2–3 minutes. The Rapid Dataset Iteration Framework illustrates the tour that I will be guiding you (read in sequence of 1–2–3–4).

How to read the matrix

Inward on dataset: This is about self. In current instance, you are getting to know the dataset. In future instance, you are asking yourself whether this dataset will sustain over an extended period as you go through the data science workflow.

Outward of dataset: This is about your contribution to the world. This is about caring for your peer in the data science world. In current instance, you are searching for what-if and so-what of the dataset. In future instance, you are shifting your focus whether this has been done before, what is out there now.

Now: The focus is the present, and on the available dataset are accessible, useable, intelligible, and assessable.

Future: The focus is the future, and on the available dataset contributions for next steps of the data science workflow.

Touring the four rapid dataset iteration steps

Step 1 Know the dataset (Inward)

The following are the 3 questions to ask (potentially more, of course!):

  1. What are the available datasets?
  2. What are the features (read: columns) you can use?
  3. What are their current state of the dataset — completeness, explainable?

This is where I encourage you to slow down, and take it all in. Instead of quickly reading through first 5 or 10 lines of your dataframe extraction, I encourage you to extract more than 30 to 40 lines to get closer to the data. Here you are like a data sociologist running through the data, uncovering layers upon layers of data rows. You are thinking about any potentially links between data and initial first instinct about the data. I am still a sociologist at heart (and yes, I am trained sociologist in decision science), and I understand the importance of being grounded. The current state of mind: Stay grounded with the dataset. Have beginner’s mind.

Finding connections among the data. Stay grounded with the dataset.

Step 2 Search for what-if and so-what (Outward)

The following are the 3 questions to ask:

  1. What are the interesting problems that you can tackle?
  2. Why are these problems interest you?
  3. So what about others interest?

This is where I encourage you to think about others. For example, if you are going through the Airbnb dataset, I do encourage you to read through a few articles about Airbnb to get a sense of the domain space that you will be exploring further. Here you are like a data sociologist running through the data. At the same time, reading through related articles and blogs to understand where you are. The current state of mind: Be a reader of what’s out there. Sensing the outer world.

Searching for meaning in the comments. Deepening your data connection by reading other articles.

Step 3 Shift to what’s out there now

The following key question to ask:

  1. What can you find that can be useful/interesting/ insightful to your questions/ problem statement?

This is where I encourage you to broaden your worldview of next set of questions that you may want to ask to gain the followings: finding alternatives, understanding consequences, weighing trade-off, and data uncertainty impact. The current state of mind: Be an explorer of possible.

Step 4 Sustain your work

The following are two questions to ask:

  1. What are the key indicators that these datasets can sustain your work over an extended period?
  2. When possible — can these datasets sustain you pass feature engineering stage?

After 2–3 rapid dataset iteration of Step 1–3, you will start to discover a greater sense of where you can go in the next days or weeks as you go through the data science workflow. This is where I encourage you to move on when you are ready. The current state of mind: Be ready to start exploring further — creatively and think laterally.

TAKEAWAYS

Going back to the title of this article, the quick wins here mean deepening your understanding of the datasets that you have in hand will help you win the long game. The long game here means by knowing the ground truth of the dataset, you have the confidence to be more creative and think laterally.

END NOTES

This article is part of the Data Scientist Pocket Guidebook Series (Please check out a similar guidebook series with more focus on machine side of data science —Product Data Scientist Pocket Guidebook Series).

Hopefully, this will be a handy reference that will help you navigate the basics on trending and challenging data science and machine learning topics. Ideal for aspiring data scientists and machine learning engineers who wants to get pro-tips and case studies.

--

--