The Three P’s of Data Science

A brief overview of the Purpose, Process, People (and Python libraries) of Data Science.

Nicole Scott
4 min readJul 24, 2019

Purpose

At it’s simplest, Data Science is about exploration and prediction. As Data Scientists we gather interesting pieces of data from the world around us and analyze them to discover unique or interesting patterns. Based on our hunches about those patterns, we then apply algorithms to gain additional insights and create value. Some practical applications of data science are in the fields of:

  • Healthcare — medical image analysis
  • Finance — fraud detection
  • Weather — storm detection
  • Business — market segmentation

Process

The discipline of Data Science — which is also an art — is evolving at the intersection of technology, human curiosity and specific domain knowledge (business, healthcare, politics, etc.). Regardless of application, there is a general Data Science process:

  1. Define the problem
  2. Collect the data
  3. Clean, explore and analyze the data
  4. Transform and prepare the data for modeling
  5. Evaluate outcomes for accuracy and errors
  6. Make conclusions and recommendations

Each step in this workflow includes specific tasks which fall under the Data Science Hierarchy of Needs as outlined in Monica Rogati’s article The AI Hierarchy of Needs.”

Image per Monica Rogati

While there is much media focus on Artificial Intelligence, AI represents just the tip of this pyramid. Today, the vast majority of Data Science tasks represent the building blocks upon which the possibilities of AI can be explored. Data must be collected, cleaned, stored, protected and analyzed. Important decisions must be made about how best to transform, aggregate and label data in order to build meaningful and insightful models. Those models must be iteratively tuned and trained to learn the most optimal ways to answer our most pressing questions.

People

So who will do all of this? Who will perform the requisite data wrangling, munging, modeling and analysis? Data Scientists of course! Whether working alone in small startups as individual “generalists” wearing many hats, or as a part of large teams, with a more specialized focus, Data Scientists will do this work.

Image per Moazzem Hossain: “The 10 Algorithms Data Scientist must have to Know

A well-rounded data scientist will have a variety of interdisciplinary skills, perspectives, and life experiences. And a solid grasp of the fundamentals of statistics. Statistical methods enable us to turn large data sets into more accessible “simplifications of reality.” Key concepts include probabilities and odds, discrete (countable) vs. continuous (infinite) distributions, and statistical inference (like hypothesis testing and confidence intervals). Anyone interested in diving deeper into these concepts should study An Introduction to Statistical Learning” (ISLR).

Anyone interested in more of a wade in the statistical waters should check out the 2018 PBS Documentary “NOVA: Prediction by the Numbers”. It provides an entertaining and accessible look into concepts like probability, Bayesian statistics, machine learning and ensemble models — all in less than an hour!

Inside the modern Data Scientist’s toolkit is yet another “P”: Python

Python is a powerful, versatile and stable open source programming language with an active developer community which is continuously extending it’s functionality and maintaining extensive documentation. It provides a plethora of libraries for both web development and Data Science. Some of the most popular Python libraries (and uses) are outlined below:

  1. Web Scraping & API RequestsBeautifulSoup / Requests
  2. Data Loading & CleaningPandas / NumPy
  3. Exploratory Data Analysis (EDA) Pandas / Matplotlib / Seaborn
  4. Interactive Data VisualizationsBokeh / Plotly
  5. Machine LearningScikit-Learn

The Data Science workflow is iterative and there are many decisions to be made at each step along the way, including which tool to use for each task at hand. This is especially true with Machine Learning where the first decision is usually about the type of problem to be solved. Is it a linear regression or classification (logistic regression) problem? The diagram below is an excellent example of the robust Python documentation available to assist Data Scientists as we navigate the complexities of training, testing and tuning models.

Image per Scikit-Learn Documentation

Conclusion

I hope you’ve learned something new from this brief overview of the three (plus one!) P’s of Data Science. Whether you are already a Data Scientist, or a nascent one, I hope you are inspired to find some interesting data and a problem to solve. Get your hands on some free open source tools and go where your curiosity leads you! As you iterate through the process, try to think about where you are in the hierarchy of needs, keep asking the right questions, and please share what you learn.

Thanks for reading!

--

--

Nicole Scott

Data Scientist | Photographer | Capturing what I see | Communicating what I discover