Half Year Plan : Way to be a data scientist

用於記錄自己學習歷程,能夠清楚知道自己正位在哪一個階段。

經過半年的自學,目前還算熟悉R package,但沒有完整Coding的能力,故以DSA為另外半年的重點,前段日子分心想找實習,做回Marketing老本行,是一種逃避困難的表現,也慶幸有一段碰撞期,現在非常清楚自己的方向與能完成的事。

提醒不能忘記當初來美國的初衷,進入FLAG,成為頂尖Data Science。

未來半年(7–12月)將潛心學習CS基礎知識,視情況開始刷題,下半學期(1–4月)則著重專案參與,將這半年的知識化為經驗,目標明年2,3月開始投遞大公司履歷,4月以“Data Analyst”的職稱進入新創公司實習,7月份進入其他城市的大公司實習。

自學需要極大的耐心與毅力,謝謝身旁支持我與愛我的人,這條路不好走,但我真的非常享受,有你們的陪伴讓我天天都有意義,剩下的,等我成功再說。

如果你剛好看到這篇文章,也跟我有相同的目標和正進行類似的課程,歡迎與我聯繫: weichien711@gmail.com

簡易的規劃圖

快速入門 Python :Codecademy (1–2weeks)

了解基本編程邏輯與指令即可。

1. Python Syntax
2. Strings and Console Output
3. Conditionals and Control Flow
4. Functions
5. Lists & Dictionaries
6. Student Becomes the Teacher
7. Lists and Functions
8. Loops
9. Exam Statistics
10. Advanced Topics in Python
11. Introduction to Classes
12. File Input and Output
13. PYTHON FINAL PROJECT

基本Computer Science知識 & Python操作

著重在兩門MIT的CS課程,使用Python教學。

MITx: 6.00.1x Introduction to Computer Science and Programming Using Python

1. 計算機概論 2. 資料結構(簡單) 3. 演算法概論(簡單) 4. 物件導向設計

5. 程式除錯(程式語法面)

Lecture 1 — Introduction to Python:

Machines / Languages / Types / Variables / Operators and Branching

Lecture 2 — Core elements of programs:
Strings / Input & Output / IDEs / Control Flow / Iteration / Guess and Check

Lecture 3 — Simple Programs:
Approximate Solutions / Bisection Search / Floats and Fractions / Newton-Raphson

Lecture 4 — Functions:
Decomposition and Abstraction / Functions and Scope / Keyword Arguments / Specifications / Iteration vs Recursion / Inductive Reasoning / Towers of Hanoi /
Fibonacci / Recursion on non-numerics / Files

Lecture 5 — Tuples and Lists:
Tuples / Lists / List Operations / Mutation, Aliasing, Cloning

Lecture 6 — Dictionaries:
Functions as Objects / Dictionaries / Example with a Dictionary / Fibonacci and Dictionaries / Global Variables

Lecture 7 — Debugging:
Programming Challenges / Classes of Tests / Bugs / Debugging / Debugging Examples

Lecture 8 — Assertions and Exceptions
Assertions / Exceptions / Exception Examples

Lecture 9 — Classes and Inheritance:
Object Oriented Programming / Class Instances / Methods / Classes Examples / Why OOP
Hierarchies / Your Own Types

Lecture 10 — An Extended Example:
Building a Class / Viualizing the Hierarchy / Adding another Class / Using Inherited Methods / Gradebook Example / Generators

Lecture 11 — Computational Complexity:
Program Efficiency / Big Oh Notation / Complexity Classes / Analyzing Complexity

Lecture 12 — Searching and Sorting Algorithms:
Indirection / Linear Search / Bisection Search / Bogo and Bubble Sort /Selection Sort / Merge Sort

Lecture 13 — Visualization of Data:
Visualizing Results / Overlapping Displays / Adding More Documentation / Changing Data Display / An Example Lecture

MITx: 6.00.2x Introduction to Computational Thinking and Data Science

1. Python繪圖(模擬MatLab的系統) 2. 機率與統計(簡單) 3. 數值分析(簡單) 4. 演算法概論(優化介紹) 5. 計算機智慧(簡單)

Lecture 1 — Optimization and Knapsack Problem:
Computational models / Intro to optimization / 0/1 Knapsack Problem / Greedy solutions

Lecture 2 — Decision Trees and Dynamic Programming:
Decision tree solution to knapsack / Dynamic programming and knapsack /
Divide and conquer

Lecture 3 — Graphs:
Graph problems / Shortest path / Depth first search / Breadth first search

Lecture 4 — Plotting:
Visualizing Results / Overlapping Displays / Adding More Documentation /
Changing Data Display / An Example

Lecture 5 — Stochastic Thinking:
Rolling a Die / Random walks

Lecture 6 — Random Walks:
Drunk walk / Biased random walks / Treacherous fields

Lecture 7 — Inferential Statistics:
Probabilities / Confidence intervals

Lecture 8 — Monte Carlo Simulations:

Lecture 9 — Monte Carlo Simulations:
Sampling /Standard error

Lecture 10 — Experimental Data:
Errors in Experimental Observations / Curve Fitting

Lecture 11 — Experimental Data:
Goodness of Fit / Using a Model for Predictions

Lecture 12 — Machine Learning:
Feature Vectors / Distance Metrics • Clustering

Lecture 13 — Statistical Fallacies
Misusing Statistics / Garbage In Garbage Out / Data Enhancement

Algorithm & Data Structure 進階課程

與前面的MIT互補,課程比較偏向理論,作業當作Python project練習。

Data Structures and Algorithms Specialization @UCSD (6parts

Basic knowledge of at least one programming language (C, C++, C#, Haskell, Java, JavaScript, Python2, Python3, Ruby, and Scala): loops, arrays, stacks, recursion. Basic knowledge of mathematics: proof by induction, proof by contradiction.

Course1 — Algorithmic Toolbox
Greedy Algorithms
Divide-and-Conquer
Dynamic Programming

Course2 — Data Structures
Basic Data Structures
Dynamic Arrays and Amortized Analysis
Priority Queues and Disjoint Sets
Hash Tables
Binary Search Trees
Binary Search Trees 2

Course3 — Algorithms on Graphs
Decomposition of Graphs 1
Decomposition of Graphs 2
Paths in Graphs 1
Paths in Graphs 2
Minimum Spanning Trees
Advanced Shortest Paths Project (Optional)

Course4 — Algorithms on Strings
Suffix Trees
Burrows-Wheeler Transform and Suffix Arrays
Knuth–Morris–Pratt Algorithm
Constructing Suffix Arrays and Suffix Trees

Coursera5 — Advanced Algorithms and Complexity
Flows in Networks
Linear Programming
NP-complete Problems
Coping with NP-completeness
Streaming Algorithms (Optional)

Statistic

統計與R已經學了半年,基本理論架構清晰,隨著學校課程開始,將課堂作業與專案當作練習,定期複習幾個比較重要的理論,半年後進階到機械學習,還是會需要鑽研這塊。

Statistics with R Duke University

COURSE 1 Introduction to Probability and Data
COURSE 2 Inferential Statistics
COURSE 3 Linear Regression and Modeling
COURSE 4 Bayesian Statistics
Decision Making
Bayesian Regression
Perspectives on Bayesian Applications
Data Analysis Project

COUSRE 5 Capstone ( 這個月可以完成專案,拿到Certification )

SQL & Data Warehousing (額外技能)

已經上了快一個月,了解基本SQL操作,但當務之急是學好DSA,由於開學後有一門Data Managment,課程內容大致相同,完成第一門課程後暫停。

Data Warehousing for Business Intelligence Specialization
University of Colorado

Course1 — Database Management Essentials
Relational Data Model and the CREATE TABLE Statement
Basic Query Formulation with SQL
Extended Query Formulation with SQL
Notation for Entity Relationship Diagrams
ERD Rules and Problem Solving
Developing Business Data Models
Data Modeling Problems and Completion of an ERD
Schema Conversion
Normalization Concepts and Practice

Course2 — Data Warehouse Concepts, Design, and Data Integration
Data Warehouse Concepts and Architectures
Multidimensional Data Representation and Manipulation
Data Warehouse Design Practices and Methodologies
Data Integration Concepts, Processes, and Techniques
Architectures, Features, and Details of Data Integration Tools

Course3 — Relational Database Support for Data Warehouses
DBMS Extensions and Example Data Warehouses
SQL Subtotal Operators
SQL Analytic Functions
Materialized View Processing and Design
Physical Design and Governance

Course4 — Business Intelligence Concepts, Tools, and Applications
Decision Making and Decision Support Systems
Business Intelligence Concepts and Platform Capabilities
Data Visualization and Dashboard Design
Business Performance Management Systems
BI Maturity, Strategy, and Summative Project

Course5 — Design and Build a Data Warehouse for Business Intelligence Implementation

閱讀清單:兩本書,讀熟,讀透。

Programming Collective Intelligence

Chapter 1, Introduction to Collective Intelligence
Chapter 2, Introduces the collaborative filtering


Chapter 3, Discovering Groups Builds on some of the ideas in Chapter 2 and introduces two different methods of clustering, which automatically detect groups of similar items in a large dataset.

Chapter 4, Searching and Ranking Describes the various parts of a search engine including the crawler, indexer, and query engine. It covers the PageRank algorithm for scoring pages based on inbound links and shows you how to create a neural network that learns which keywords are associated with different results.

Chapter 5, Optimization Introduces algorithms for optimization, which are designed to search millions of possible solutions to a problem and choose the best one. The wide variety of uses for these algorithms is demonstrated with examples that find the best flights for a group of people traveling to the same location, find the best way of match- ing students to dorms, and lay out a network with the minimum number of crossed lines.

Chapter 6, Document Filtering Demonstrates Bayesian filtering, which is used in many free and commercial spam filters for automatically classifying documents based on the type of words and other features that appear in the document. This is applied to a set of RSS search results to demonstrate automatic classification of the entries.

Chapter 7, Modeling with Decision Trees Introduces decision trees as a method not only of making predictions, but also of modeling the way the decisions are made. The first decision tree is built with hypothetical data from server logs and is used to predict whether or not a user is likely to become a premium subscriber. The other examples use data from real web sites to model real estate prices and “hotness.”

Chapter 8, Building Price Models Approaches the problem of predicting numerical values rather than classifica- tions using k-nearest neighbors techniques, and applies the optimization algorithms from Chapter 5. These methods are used in conjunction with the eBay API to build a system for predicting eventual auction prices for items based on a set of properties.

Chapter 9, Advanced Classification: Kernel Methods and SVMs Shows how support-vector machines can be used to match people in online dat- ing sites or when searching for professional contacts. Support-vector machines are a fairly advanced technique and this chapter compares them to other methods.

Chapter 10, Finding Independent Features Introduces a relatively new technique called non-negative matrix factorization, which is used to find the independent features in a dataset. In many datasets the items are constructed as a composite of different features that we don’t know in advance; the idea here is to detect these features. This technique is demon- strated on a set of news articles, where the stories themselves are used to detect themes, one or more of which may apply to a given story.

Chapter 11, Evolving Intelligence Introduces genetic programming, a very sophisticated set of techniques that goes beyond optimization and actually builds algorithms using evolutionary ideas to solve a particular problem. This is demonstrated by a simple game in which the computer is initially a poor player that improves its skill by improving its own code the more the game is played.

Chapter 12, Algorithm Summary Reviews all the machine-learning and statistical algorithms described in the book and compares them to a set of artificial problems. This will help you understand how they work and visualize the way that each of them divides data.

(摘錄至內文)

Introduction to Algorithms

1. Growth of Functions
2. Divide-and-Conquer
3. Probabilistic Analysis and Randomized Algorithms
4. Quicksort
5. Heapsort
6. Sorting in Linear Time
7. Medians and Order Statistics

Data Structures Introduction

8. Elementary Data Structures
9. Hash Tables
10. Binary Search Trees
11. Red-Black Trees
12. Augmenting Data Structures

Advanced Design and Analysis Techniques Introduction

13. Dynamic Programming
14. Greedy Algorithms
15. Amortized

Advanced Data Structures Introduction

16. B-Trees
17. Fibonacci Heaps
18. van Emde Boas Trees

Graph Algorithms

19. Data Structures for Disjoint Sets
20. Elementary Graph
21. Minimum Spanning Trees
22. Single-Source Shortest Paths
23. All-Pairs Shortest Paths
24. Maximum Flow

Selected Topics Introduction

25. Multithreaded Algorithms
26. Matrix Operations
27. Linear Programming
28. Polynomials and the FFT
29. Number-Theoretic Algorithms
30. String Matching
31. Computational Geometry
32. NP-Completeness
33. Approximation Algorithms

以下,引用一篇在知乎上看見的文:

这些数学术语看起来很高端很前沿,其实就是逗你玩,让计算机本科生觉得数学家好伟大的一个东西。也就是说,不用学那么多数学,paper也可以看懂。

最基础所以也是最重要的的当然是离散数学(Discrete Math) ,其内容包括很基础的排列组合、图论、数论等。

第一层要求就是知道术语,这样看paper就不会那么恶心了。第二层是明白定义想得出来基础证明和了解应用方向,作为计算机专业,这个你肯定听得多了。第三层是能自行计算,延伸等。这是个很像高中奥林匹克数学的大学计算机基础课。假如你都知道就当我没说。假如你很多东西都不知道的话就去上一门Discrete Math的课吧。Coursera有,MIT OpenCourseWare 也有(MIT OCW不用等:Mathematics for Computer Science)。

万一看不下去paper总得有学术之外的事情做吧。计算机专业的话,我假设你的数据结构(Data Structures)跟算法(Algorithms)都说的过去,不然找不到工作。其中算法与数学关系较大。尽管离散数学不好算法怎么学得好我也不知道。。。学术方面,算法跟图论有很多重叠,但我还是假设你的离散数学还可以,所以你只要是看CLRS(算法导论/Introduction To Algorithms)到证明NP-Completeness就可以了。

之后便是运筹/线性最优化 (Operations Research/Linear Programming),组合优化的一个重要部分。我估计你在看跟这个有点关系的paper。此学科听起来很恶心,但是学起来很容易很容易,只需要线性代数的基础,连eigenvalue什么的都一般不需要。普通的linear programming你应该在学算法的时候接触过,没接触过的话请复习CLRS。要想学些这方面的应用数学,Coursera有Linear & Discrete Optimization 还有 Linear & Integer Programming。选一个老师长得比较顺眼的,六个礼拜应该可以搞定。

概率论和随机过程其实算是两节课,但是概率论比较简单(假设离散数学没有问题的话),所以学完线性最优化之后就应该是Stochastic Optimization/Methods/Processes。这个找大学教材就可以了,很完善的一堂课,应用数学啊工程啊都会去上。其实要是只为了读懂这个paper,查一下用的模型就可以了。

代数对于你过于抽象,离CS应用比较遥远,除非你在做化学有关的课题。我在数学系学到表示论(Representation Theory)跟侧重数学的傅立叶分析(Princeton Lectures: Fourier Analysis by Stein)等虐死人的课才能深刻理解DFT,FFT等本科工程师整天用结果的东西,那时才了解到大多人要研究很多很多年才能提取出这个CS-Algebra相关的抽象。再说跟你想看的paper的关系也很小,所以暂时不必要。

数学分析很有用,但是也不是什么外行喜欢看的学科,跟你的paper有关系的概率十分地小,假如有关系的话也很可能是你半年学分析还没学到的,是个得不偿失的选择。

machine learning 和 data mining 这种超级火的东西,我还没有review过很多教材,所以不能说我什么都很知道。但从我看到过的,线性代数都会被用到。我知道feature extraction总是要用些signal processing,spectral theory的东西的。所以线性代数还是要学到深一些为好。但线代本身其实是学不完的,所以我想最基础的也应该学到change of basis, eigen啥啥,最好能明白pseudo-inverse。
https://www.zhihu.com/question/22539111