Foundations Of Data Science Part 1 — Data Science: An Introduction — Learning Notes

From freeCodeCamp.org Youtube Video (Part 1)

Reksi Arismunandar
10 min readNov 9, 2021

Data Science: An Introduction

I. The Data Science Venn Diagram

Data Science are combinations from:
Coding + Statistic + Domain/Business knowledge

The Data Science Venn Diagram

Coding

- Stats: R & Python
- Database: SQL
- Command-line: Bash
- Search: Regex

Math

- Probability, algebra, regression, etc.
- Choose precedure
- Diagnose problems

Domain

- Expertise in field
- Goals, methods, & constraints
- Can implement well

Machine Learning (ML)

- Coding & math without domain
- Black box models: they throw data in, but they don’t know what it means or what language it is in
- It’s defferent with Data Science

Traditional Reasearch

- Math & domain without coding
- Data is structured: the data in this zone is ready for analysis
- Effort is in method & interpretation

Danger Zone

- Coding & domain without math?
- Unlikely to happen?
- Word counts, maps.

II. The Data Science Pathway

First: Planning
Second: Data prep.
Third: Modeling
Fourth: Follow up

Planning

1. Define goals
2. Organize resources
3. Coordinate people
4. Schedule project

Data Prep.

1. Get data
2. Clean data
3. Explore data
4. Refine data

Modeling

1. Create model: create statistical model (regression analysis, neural network, etc.)
2. Validate model
3. Evaluate model
4. Refine model

Follow Up

1. Present model: share in a meaningful way with other people
2. Deploy model: being done in order to accomplish something. so, for instance, if you working with an e-commerce site, you may be developing a recommendation engine.
3. Revisit model
4. Archive assets

- DS isn’t just technical = planning, presenteing, and implementing are important
- Contextual skills matter = knowing how it work in particular field, knowing how it will implemented
- One step at the time

III. Roles In Data Science

Collaborative thing — All together now

Engineer

- Focus on back end hardware, software
- Makes DS possible
- Developer, DBA
- Provide fondation of the rest of the work

Big Data

- Focus on computer science & math
- Machine learning
- Data products: a thing that tells you what restaurant to go

Researcher

- Focus on domain spesific research
- Physics, genetics
- Very strong statistics
- They focus in spesific question

Analyst

- Day-to-day tasks
- Web analytics, SQL
- Good for business
- Not exactly DS? Because most of the data they are working with is going to be pretty structured
- They play a critical role in business in general

Business

- Frames business relevant question that can be answered with the data
- Manages project
- Must “speak data”

Entrepreneur

- Data startups
- Needs data & business skills
- Creative throughout

Full-stack “unicorn”

- Who can do everything at an expert level
- They may noot actually exist

- Data science is diverse
- Different goals & skills
- Different contexts

IV. Teams In Data Science

Coding
Statistic
Design
Business

Who can do it all? The unicorn
A mythical data scientist with universal abilities

Unicorn By Team (2 People)

- Can’t do DS on your own
- People need people
- Make collective unicorns

Collective Unicorn From 2 People

V. Contrast Big Data

Data science & Big data = similar, not same

Venn Diagram Data Science
Venn Diagram Big Data
Venn Diagram Big Data Science

VI. Contrast Coding

- Data science ≠ coding
- Share tools & practices
- But statistic is critical

VII. Contrast Statistic

The world’s 7 most powerful data scientists (in Forbes.com)
5 in CS, 3 in Math, 2 in Engineering, 1 each in Biology, Economics, Law, Speech pathology, & Statistics

- DS & stats both use data
- Different backgrounds
- Different goals & contexts

VIII. Contrast Business Intelligence

- No coding in BI
- Simple statistics
- Focus on domain expertise & utility

DS & BI

- BI is very goal-oriented
- DS prepares data & form
- DS can learn from BI

IX. Do No Harm

Privacy

- Confidentiality
- Shouldn’t share
- Sources not intended for sharing?

Anonymity

- Not hard to identify peope in data
- HIPAA — Health insurance portability and accountability act. Before HIPAA, it was easy to identify people from medical records.
- Proprietary data may have identifiers: If you are working for a client, that data may have identifiers. You may know who the people are, they are not anonymous anymore. So, anonymity may or may not be there, but major effort to make data anonymous. The primary thing is even you do know who they are, that you still maintain the privacy and confidentiality.

Copyright

- Just because something is on the web, doesn’t mean that you are alllowed to use it. Scraping data is common and useful
- Check copyright

Data Security

- Be careful with hacker
- If there’s people that no longer part of your team, and he have your organization data, make sure the data isn’t in him.

- DS has potential & risks
- Analyses can’t be neutral
- Good judgment is vital

X. Methods Overview

- DS includes tech
- But DS > tech
- Tech is means to insight

XI. Sourcing Overview

Use Existing Data

- In-house → company records
- Open → public data
- Third-party → buy data from vendor

Use Data APIs

- Allows apps to communicate directly
- Get web data
- Import the data directly into program/aplication

Scrape Web Data

- For web data without APIs
- HTML, PDFs, etc.
- Use apps & code for scraping data

Make Data

- Get exactly what you need
- Interviews
- Surveys
- Experiments,
- Etc.

Remember this one little aphorism
GIGO:
Garbage in, garbage out
It means if you have bad data that you are feeding into your system, you are not going to get anything worthwhile any real insights out of it.

Pay attention to metrics & meaning.

Sum:
- Get the raw materials
- Many possible methods
- Check quality and the meaning of the data

XII. Coding Overview

Apps

Specialized apps for working with data
- Spreadsheets: Excel, Google sheet
- Tableau: For data visualization
- SPSS: Statistical package in the social sciences and in businesses
- JASP: Free open source analog of SPSS
- Etc.

Data

Special formats for web data
- HTML
- XML
- JSON
- Etc.

Code

Language that give you full control
- R
- Phyton
- SQL
- C, C++, & Java
- Bash
- Regex

Remember. Tools are just tools. They are only part of the process. There are means to the end, and the end, the goal is INSIGHT. You need to know where you are trying to go and then simply choose the tools that help you reach that particular goal.

Sum:
- Use tools wisely
- A few is usually enough
- Focus on your goal

XIII. Math Overview

Math is the fondation of what we’re going to do.

Why we need math?
1. Know which procedures to use & why
2. Know what to do when things don’t work right
3. Some math is easier & quicker by hand than computer

Math: Data science::
- Chemistry:Cooking → We can be a wonderful cook without knowing any chemistry, but if you know some chemistry it’s going to help
- Kinesiology:Dancing
- Grammar:Writing

What kinds of math do you need for data science?
- Algebra: Elementary algebra, linear (matrix) algebra, System of linear equations
- Calculus
- Big O: Which has to do with the order of a function, sort of how fast it works
- Probability
- Bayes’ theorem

Sum:
A little bit of math:
- Can help you make informed choices when planning your analyses
- Find and fix problems
- Can even do by hand sometimes

XIV. Statistics Overview

What we trying to do here:

Explore

- Exploratory graphics
- Exploratory statistics, a numerical exploration of the data
- Descriptive statistics

Inference

- From samples to populations
- Hypothesis testing
- Estimation/confidence intervals

Details

- Feature selection
- Problems
- Validation
- Estimators
- Fit

Beware the trolls
There are people out there who will tell you that if you don’t do things exactly the way they say to do it, that your analysis is meaningless, that your data is junk and you’ve lost all your time. You know what? They’re trolls.

You can make enough of an informed decision on your own to go ahead and do an analysis that is still useful.

This wonderful quote from a very famous statistician:
All statistical models are wrong, but some are useful
— George Box

Sum:
- Statistics allow you to explore, describe your data, and infer things about the population
- Many choices available
- Goal is useful insight

XV. Machine Learning Overview

Data Space

- Dimension reduction: Try to find the most essential parts of the data
- Clustering
- k-Means Method
- Anomalies/unusual cases that show up in the data space

Categories

- Logistic regression
- k-nearest neighbors (kNN)
- Naive Bayes: For classification
- Decision trees
- Support Vector Machines (SVM)
- Artificial neural nets.

Predictions

- Linear regression
- Poisson regression: Used for modeling count or frequency data
- Ensemble models: Where you create several models and you take the predicions from each of those and you put them together to get an overall more reliable prediction.

Sum:
- Categorize & predict
- Many choices available
- Goal is useful insight

XVI. Interpretability

When you are doing the analysis, you’re trying to do is solve for value.

Analysis ≠ Value

Analysis * Story = Value
That’s multiplicative, not additive
So, if you have no story, it means you have no value:
Analysis * 0 = 0

What we really want to do is, we want to maximize the story. So that we can maximize the value that results from our analysis.
Analysis * Max(Story) = Max(Value)

The goal is Max(Value)

Goals

- Analysis is goal-driven
- Story should match goals
- Answer question from client clearly & unambiguously

Client ≠ You

- Egocentrism
- False consensus: “Everybody knows that” is not true
- Anchoring
- Clarity at each step

Answer

- State the question
- Give your answer
- Qualify as needed
- Go in order
- Discuss process sparingly

Analysis = simplification

Everyting should be made as simple as possible, but not simpler
— Albert Einstein

Less is more
— Ludwig Mies van der Rohe after Robert Browning

Be minimally sufficient
— Psychological researcher

Minimal Viable Product = Minimal Viable Analysis
— Commerce

Few tips when you’re giving the presentation:
- More charts, less text
- Simplify charts
- Avoid tables
- Less text (again)

Sum:
- Stories give value
- Address client’s goals
- Be minimally sufficient

XVII. Actionable Insights

My thinking is first and last and always for the sake of my doing
— William James

Your analysis or your data is for the sake of your doing
— The idea

Point The Way

- Why was the project conducted?
- Goal is usually to direct action
- Analysis should guide action

Next Steps

- Give the next steps: Tell them what they need to do now
- Justify with data each those recommendations with the data and your analysis
- Be specific tell them exactly what they need to do
- Make sure it’s doable by the client
- Build on each step

Correlation Vs. Causation

- Your data gives you correlation
- But your client wants causation

How To Get From Correlation To Causation?

- Experimental Studies: Randomized, controlled trials are simplest path to causality
- Quasi-experimets: Methods that use non-randomized data for causal inference
- Theory & Experience: Reasearch based theory & domain-specific experience

Social Factor — To Valid Data Science

Data Science With Social

Few kinds of social understanding
- Client’s Mission: Make sure that your recommendations are consistent with your client’s mission
- Clien’s Indentity
- Business Context: Sort of the competitive env. and the regulatory env.
- Social Context: Your recommendations can be realized the way they need to be

Sum:
- DS is goal-focused
- Give specific next steps
- Be aware of context (be aware of the social, political, and economic context)

XVIII. Presentation Graphics

Exploratory Graphics

- Need speed & responsiveness
- Need clarity & narrative flow: Flat, static graphics can often be more informative because they have fewer distractions in them

Sum:
- Graphics you use for Presenting ≠ Graphics you use for Exploring
- Be clear, be focused
- Create a strong narrative

XIX. Reproducible Reasearch

Data science project are rarely “one and done.”

Rather, they are: Incremental, Comulative, & Adaptive

Important things here:

Show Your Work

- Revising
- Borrowing
- Handing off
- Accountability

Open Data < Open Data Science

- odsc.com
- osf.io
- j.mp/aps-op

Archives

- All data sets, both raw & processed
- All code to process & analyze data
- Comment liberally

Process

- Explain why you did it the way you did
- Include choices, consequences, backtracking

Future Proofing

- Data: Store data in non-proprietary formats, like CSV
- Storage: Place files in a secure accessible location, like GitHub
- Code: Dependency management with packrat for R or virtualenv for Python

Explain Yourself

Put your narrative in a notebook
- Jupyter for Python
- R Markdown for R

Sum:
- Support collaboration
- Future-proof your work
- Share your narrative

XX. Next Steps

Few ideas to do next:
- Coding in R & Phyton
- Data visualization
- Statistics & math
- Machine learning

Maybe you can also try looking at Data Sourcing

Keep it in context. Data Science can be applied to marketing, sports, health, education, the arts, etc.

Maybe you can also getting involved in the community of Data Science.
- O’Reilly Strata
- Predictive analytics world
- TapestryConference.com
- Extract by import.io
- Kaggle.com
- DataKind.org

Data science is needs you! Bye~

This article is my learning notes from https://www.youtube.com/watch?v=ua-CiDNNj30

--

--