Foundations Of Data Science Part 1 — Data Science: An Introduction — Learning Notes
From freeCodeCamp.org Youtube Video (Part 1)
Data Science: An Introduction
I. The Data Science Venn Diagram
Data Science are combinations from:
Coding + Statistic + Domain/Business knowledge
Coding
- Stats: R & Python
- Database: SQL
- Command-line: Bash
- Search: Regex
Math
- Probability, algebra, regression, etc.
- Choose precedure
- Diagnose problems
Domain
- Expertise in field
- Goals, methods, & constraints
- Can implement well
Machine Learning (ML)
- Coding & math without domain
- Black box models: they throw data in, but they don’t know what it means or what language it is in
- It’s defferent with Data Science
Traditional Reasearch
- Math & domain without coding
- Data is structured: the data in this zone is ready for analysis
- Effort is in method & interpretation
Danger Zone
- Coding & domain without math?
- Unlikely to happen?
- Word counts, maps.
II. The Data Science Pathway
First: Planning
Second: Data prep.
Third: Modeling
Fourth: Follow up
Planning
1. Define goals
2. Organize resources
3. Coordinate people
4. Schedule project
Data Prep.
1. Get data
2. Clean data
3. Explore data
4. Refine data
Modeling
1. Create model: create statistical model (regression analysis, neural network, etc.)
2. Validate model
3. Evaluate model
4. Refine model
Follow Up
1. Present model: share in a meaningful way with other people
2. Deploy model: being done in order to accomplish something. so, for instance, if you working with an e-commerce site, you may be developing a recommendation engine.
3. Revisit model
4. Archive assets
- DS isn’t just technical = planning, presenteing, and implementing are important
- Contextual skills matter = knowing how it work in particular field, knowing how it will implemented
- One step at the time
III. Roles In Data Science
Collaborative thing — All together now
Engineer
- Focus on back end hardware, software
- Makes DS possible
- Developer, DBA
- Provide fondation of the rest of the work
Big Data
- Focus on computer science & math
- Machine learning
- Data products: a thing that tells you what restaurant to go
Researcher
- Focus on domain spesific research
- Physics, genetics
- Very strong statistics
- They focus in spesific question
Analyst
- Day-to-day tasks
- Web analytics, SQL
- Good for business
- Not exactly DS? Because most of the data they are working with is going to be pretty structured
- They play a critical role in business in general
Business
- Frames business relevant question that can be answered with the data
- Manages project
- Must “speak data”
Entrepreneur
- Data startups
- Needs data & business skills
- Creative throughout
Full-stack “unicorn”
- Who can do everything at an expert level
- They may noot actually exist
- Data science is diverse
- Different goals & skills
- Different contexts
IV. Teams In Data Science
Coding
Statistic
Design
Business
Who can do it all? The unicorn
A mythical data scientist with universal abilities
Unicorn By Team (2 People)
- Can’t do DS on your own
- People need people
- Make collective unicorns
V. Contrast Big Data
Data science & Big data = similar, not same
VI. Contrast Coding
- Data science ≠ coding
- Share tools & practices
- But statistic is critical
VII. Contrast Statistic
The world’s 7 most powerful data scientists (in Forbes.com)
5 in CS, 3 in Math, 2 in Engineering, 1 each in Biology, Economics, Law, Speech pathology, & Statistics
- DS & stats both use data
- Different backgrounds
- Different goals & contexts
VIII. Contrast Business Intelligence
- No coding in BI
- Simple statistics
- Focus on domain expertise & utility
DS & BI
- BI is very goal-oriented
- DS prepares data & form
- DS can learn from BI
IX. Do No Harm
Privacy
- Confidentiality
- Shouldn’t share
- Sources not intended for sharing?
Anonymity
- Not hard to identify peope in data
- HIPAA — Health insurance portability and accountability act. Before HIPAA, it was easy to identify people from medical records.
- Proprietary data may have identifiers: If you are working for a client, that data may have identifiers. You may know who the people are, they are not anonymous anymore. So, anonymity may or may not be there, but major effort to make data anonymous. The primary thing is even you do know who they are, that you still maintain the privacy and confidentiality.
Copyright
- Just because something is on the web, doesn’t mean that you are alllowed to use it. Scraping data is common and useful
- Check copyright
Data Security
- Be careful with hacker
- If there’s people that no longer part of your team, and he have your organization data, make sure the data isn’t in him.
- DS has potential & risks
- Analyses can’t be neutral
- Good judgment is vital
X. Methods Overview
- DS includes tech
- But DS > tech
- Tech is means to insight
XI. Sourcing Overview
Use Existing Data
- In-house → company records
- Open → public data
- Third-party → buy data from vendor
Use Data APIs
- Allows apps to communicate directly
- Get web data
- Import the data directly into program/aplication
Scrape Web Data
- For web data without APIs
- HTML, PDFs, etc.
- Use apps & code for scraping data
Make Data
- Get exactly what you need
- Interviews
- Surveys
- Experiments,
- Etc.
Remember this one little aphorism
GIGO: Garbage in, garbage out
It means if you have bad data that you are feeding into your system, you are not going to get anything worthwhile any real insights out of it.
Pay attention to metrics & meaning.
Sum:
- Get the raw materials
- Many possible methods
- Check quality and the meaning of the data
XII. Coding Overview
Apps
Specialized apps for working with data
- Spreadsheets: Excel, Google sheet
- Tableau: For data visualization
- SPSS: Statistical package in the social sciences and in businesses
- JASP: Free open source analog of SPSS
- Etc.
Data
Special formats for web data
- HTML
- XML
- JSON
- Etc.
Code
Language that give you full control
- R
- Phyton
- SQL
- C, C++, & Java
- Bash
- Regex
Remember. Tools are just tools. They are only part of the process. There are means to the end, and the end, the goal is INSIGHT. You need to know where you are trying to go and then simply choose the tools that help you reach that particular goal.
Sum:
- Use tools wisely
- A few is usually enough
- Focus on your goal
XIII. Math Overview
Math is the fondation of what we’re going to do.
Why we need math?
1. Know which procedures to use & why
2. Know what to do when things don’t work right
3. Some math is easier & quicker by hand than computer
Math: Data science::
- Chemistry:Cooking → We can be a wonderful cook without knowing any chemistry, but if you know some chemistry it’s going to help
- Kinesiology:Dancing
- Grammar:Writing
What kinds of math do you need for data science?
- Algebra: Elementary algebra, linear (matrix) algebra, System of linear equations
- Calculus
- Big O: Which has to do with the order of a function, sort of how fast it works
- Probability
- Bayes’ theorem
Sum:
A little bit of math:
- Can help you make informed choices when planning your analyses
- Find and fix problems
- Can even do by hand sometimes
XIV. Statistics Overview
What we trying to do here:
Explore
- Exploratory graphics
- Exploratory statistics, a numerical exploration of the data
- Descriptive statistics
Inference
- From samples to populations
- Hypothesis testing
- Estimation/confidence intervals
Details
- Feature selection
- Problems
- Validation
- Estimators
- Fit
Beware the trolls
There are people out there who will tell you that if you don’t do things exactly the way they say to do it, that your analysis is meaningless, that your data is junk and you’ve lost all your time. You know what? They’re trolls.You can make enough of an informed decision on your own to go ahead and do an analysis that is still useful.
This wonderful quote from a very famous statistician:
All statistical models are wrong, but some are useful
— George BoxSum:
- Statistics allow you to explore, describe your data, and infer things about the population
- Many choices available
- Goal is useful insight
XV. Machine Learning Overview
Data Space
- Dimension reduction: Try to find the most essential parts of the data
- Clustering
- k-Means Method
- Anomalies/unusual cases that show up in the data space
Categories
- Logistic regression
- k-nearest neighbors (kNN)
- Naive Bayes: For classification
- Decision trees
- Support Vector Machines (SVM)
- Artificial neural nets.
Predictions
- Linear regression
- Poisson regression: Used for modeling count or frequency data
- Ensemble models: Where you create several models and you take the predicions from each of those and you put them together to get an overall more reliable prediction.
Sum:
- Categorize & predict
- Many choices available
- Goal is useful insight
XVI. Interpretability
When you are doing the analysis, you’re trying to do is solve for value.
Analysis ≠ Value
Analysis * Story = Value
That’s multiplicative, not additive
So, if you have no story, it means you have no value:
Analysis * 0 = 0
What we really want to do is, we want to maximize the story. So that we can maximize the value that results from our analysis.
Analysis * Max(Story) = Max(Value)
The goal is Max(Value)
Goals
- Analysis is goal-driven
- Story should match goals
- Answer question from client clearly & unambiguously
Client ≠ You
- Egocentrism
- False consensus: “Everybody knows that” is not true
- Anchoring
- Clarity at each step
Answer
- State the question
- Give your answer
- Qualify as needed
- Go in order
- Discuss process sparingly
Analysis = simplification
Everyting should be made as simple as possible, but not simpler
— Albert EinsteinLess is more
— Ludwig Mies van der Rohe after Robert BrowningBe minimally sufficient
— Psychological researcherMinimal Viable Product = Minimal Viable Analysis
— Commerce
Few tips when you’re giving the presentation:
- More charts, less text
- Simplify charts
- Avoid tables
- Less text (again)
Sum:
- Stories give value
- Address client’s goals
- Be minimally sufficient
XVII. Actionable Insights
My thinking is first and last and always for the sake of my doing
— William James
Your analysis or your data is for the sake of your doing
— The idea
Point The Way
- Why was the project conducted?
- Goal is usually to direct action
- Analysis should guide action
Next Steps
- Give the next steps: Tell them what they need to do now
- Justify with data each those recommendations with the data and your analysis
- Be specific tell them exactly what they need to do
- Make sure it’s doable by the client
- Build on each step
Correlation Vs. Causation
- Your data gives you correlation
- But your client wants causation
How To Get From Correlation To Causation?
- Experimental Studies: Randomized, controlled trials are simplest path to causality
- Quasi-experimets: Methods that use non-randomized data for causal inference
- Theory & Experience: Reasearch based theory & domain-specific experience
Social Factor — To Valid Data Science
Few kinds of social understanding
- Client’s Mission: Make sure that your recommendations are consistent with your client’s mission
- Clien’s Indentity
- Business Context: Sort of the competitive env. and the regulatory env.
- Social Context: Your recommendations can be realized the way they need to be
Sum:
- DS is goal-focused
- Give specific next steps
- Be aware of context (be aware of the social, political, and economic context)
XVIII. Presentation Graphics
Exploratory Graphics
- Need speed & responsiveness
- Need clarity & narrative flow: Flat, static graphics can often be more informative because they have fewer distractions in them
Sum:
- Graphics you use for Presenting ≠ Graphics you use for Exploring
- Be clear, be focused
- Create a strong narrative
XIX. Reproducible Reasearch
Data science project are rarely “one and done.”
Rather, they are: Incremental, Comulative, & Adaptive
Important things here:
Show Your Work
- Revising
- Borrowing
- Handing off
- Accountability
Open Data < Open Data Science
- odsc.com
- osf.io
- j.mp/aps-op
Archives
- All data sets, both raw & processed
- All code to process & analyze data
- Comment liberally
Process
- Explain why you did it the way you did
- Include choices, consequences, backtracking
Future Proofing
- Data: Store data in non-proprietary formats, like CSV
- Storage: Place files in a secure accessible location, like GitHub
- Code: Dependency management with packrat for R or virtualenv for Python
Explain Yourself
Put your narrative in a notebook
- Jupyter for Python
- R Markdown for R
Sum:
- Support collaboration
- Future-proof your work
- Share your narrative
XX. Next Steps
Few ideas to do next:
- Coding in R & Phyton
- Data visualization
- Statistics & math
- Machine learning
Maybe you can also try looking at Data Sourcing
Keep it in context. Data Science can be applied to marketing, sports, health, education, the arts, etc.
Maybe you can also getting involved in the community of Data Science.
- O’Reilly Strata
- Predictive analytics world
- TapestryConference.com
- Extract by import.io
- Kaggle.com
- DataKind.org
Data science is needs you! Bye~
This article is my learning notes from https://www.youtube.com/watch?v=ua-CiDNNj30