Guide To Breaking Into Data Science
The purpose of this article is to: concisely explain what data science is, why it is useful and recommend some of the best resources for learning the discipline.
Data Science is the quantitative science of processing large amounts of unstructured data to form clearer insights and organized information.
Data may be considered“small”, as can be understood from a typical spreadsheet or computer file OR “big”, beyond the processing ability of a single computer (such as terabytes of data).
The goal is to condense all this noise to form insights that humans can understand. Sample tasks in data science include exploratory data analysis, predictive modeling, visualizations, statistical modeling and much more.
See below for much more detail on common important applications of data science, via machine learning and other approaches.
The most popular programming languages for data science are Python and R. Many opensource libraries have been developed to make common tasks easier to perform through available functions for manipulating data.
Common data science libraries include Numpy, Pandas, Matplotlib and several Machine Learning libraries like Scikit-learn, Tensorflow, PyTorch, Keras and others.
Machine Learning models can be grouped into basic categories based on their intended functions.
This is part of a larger series known as the Autodidact Project. For more information about this journey and to join — see this article:
Common Realword Cases of Machine Learning:
Here are some uses of machine learning and deep learning models:
Machine Learning algorithms:
- Predicting future stock prices based on past trading data (example of a regression task — producing a quanitative output [ie. $723.11])
- Predicting if someone will default or not on a bank loan (example of a classification task — producing a binary output [ie. Yes or No])
- Clustering data into related groups. Which data records share similar qualities and what do they share in common?
- Reducing the dimensionality of multi-omics medical data into its most important characterstiics (PCA, principal coordinate analysis)
- Detect if an event is unusual or not (anomaly detection)
Deep Learning algorithms:
- Understanding human hand-writing (image recognition)
- Generate text-to-speech (audio recognition)
- Teaching a computer how to play and beat a video game (reinforcement learning)
Decision-making algorithms:
- YouTube, Netflix, Amazon search recommendations based on your video views or shopping history. What is a user likely to want to watch or buy based on their past viewing history or shopping cart?
- Creating a decision tree. Building a visual and human-interpretable series of if-then statements.
Natural Language Processing (NLP):
- Conversational agent (ie. iPhone’s Siri personal assistant can understand some of our requests, through our voice)
- Translating between languages (ie. Google Translate)
- Sentiment analysis (ie. Is this a happy or upset Yelp review?)
- Building a chatbot (A computer listens and responds “on-the-fly” with pre-programmed responses)
The Typical Stages of a Machine Learning Project:
Machine Learning projects tend to follow a similar sequence of steps. Understanding this high-level model will make your learning process easier. The typical stages of a machine learning or data science project are as follows:
- Acquire the dataset. Common approaches include: fetching from a database via a SQL query, importing a computer file or downloading data from the web.
- Clean the data. Remove any missing values, reformat columns, exclude unnecessary info.
- Exploratory data analysis. Examine the data, produce summary statistics and preliminary plots to enable idea generation.
- Decide what model design is the best based on your data.
- Build a version of the model using an algorithm from an imported library. For example: building a linear model or classifer.
- Train the predictive model based on your data (train set).
- Test how well it performs on new data, it has never seen (test set).
- Tweak the hyperparameters to increase the model’s performance.
- Perform data storytelling. Produce plots and share the results of your final model.
Field overview:
Consider these very helpful articles and YouTube videos to get a deeper dive on the data science profession and interesting machine learning projects.
I recommend reading these articles and watching these videos before starting your own learning plan (aka an Autodidact Project).
Recommended Courses and Projects(8):
Consider taking these courses — in this sequence. Feel free to modify your own Autodidact Project.
(1) Intro to Tableau
Purpose: Study how to use Tableau, a simple program for creating a series of visual and interactive workbooks.
Why?
Make the data exploration process (preliminary analysis) easier and also your reporting of results (post analysis) easier.
Can be easily exported online and used for storytelling of data
Format: Online class, Udacity
Link:
Length: 1–2 weeks
Difficulty: Easy
Cost: Free
(2) Practical Statistics for Data Science
Purpose: Does a comprehensive walkthrough of data science — starting form the basics of relevant statistics you need to know. Progresses to the main categories of machine learning in solid detail: (a) regression models, (b) classification models and (c) unsupervised learning
The first 3 chapters cover statistics. The last 4 chapters cover machine learning principles.
Format: Book, available on Amazon
Link:
https://www.amazon.com/Practical-Statistics-Data-Scientists-Essential/dp/1491952962
Length: 7 chapters, 284 pages (1–2 months)
Difficulty: Medium
Cost: $25
(3) Python for Data Analysis
These are the core data science Python libraries you should gain more profficency with:
- Numpy — how to form N-dimensional arrays for extremely fast data processing
- Matplotlib — how to make basic graphs and plots using Python
- Pandas — how to load data from CSV, Excel files and other 2D data with rows and columns into a DataFrame data structure
Format: Book, available on Amazon
Link:
https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662/
Length: 12 chapters, 466 pages
Difficulty: Medium
Cost: $29
(4) Any Machine Learning workshop
Pick any class that gives you hands-on practice creating and using your own machine-learning model.
Format: Online
Length: 1–2 months
Difficulty: Hard
Cost: Varies
(5) Generate a Machine Learning portfolio project!
Build any project that interests you!
Consider placing on a GitHub account and on your resume
Free Udacity class on How To Use GitHub — Link
For inspiration — look at projects at Kaggle.com
(6) Deep Learning with Python
Purpose:
Deep Learning is a subset of Machine Learning.
Machine Learning is focused on building one model (regression, classification or etc.) to form a prediction.
Deep Learning forms layers of multiple models and equations, which operate based on 1000s of parameters that are optimized. Deep Learning is a tool with many applications in image recognition, speech recognition and understanding extraordinarily-complex data.
Study Artificial Neural Networks, including recurrent neural networks and convolutional neural networks.
Format: Book, available on Amazon
Link:
https://www.amazon.com/Deep-Learning-Python-Francois-Chollet/dp/1617294438/
Length: 2–3 months
Difficulty: Hard
Cost: $32
(7) Practical SQL in 10 minutes
Purpose:
The SQL programming language is used for fetching large amounts of data from databases.
Data is often structured in relational tables. These tables have column names with set data types, character limits and formatting.
Learn how to build SQL queries to fetch any data you need for data science, easily.
Format: Book, available on Amazon
Link:
https://www.amazon.com/Deep-Learning-Python-Francois-Chollet/dp/1617294438/
Length: 1–2 months, 23 short chapters
Difficulty: Medium
Cost: $28
(8) Build one more deep learning project, to place in your portfolio
Other / bonus preparation:
Use these additional learniinig resources to further your study and career advancement.
Database and Data Warehouse Design
Study how data is organized and stored.
Kaggle.com — a collection of data science competitions
How do I find interesting practice problems?
Andrew Ng’s famous Machine Learning Course — Coursera
Probably the world’s most famous intro to machine learning course
Natural Language Processing (NLP) book
How do I process textual data?
Use this for more practice and learning. Lots of helpful resources here.
Use this for more practice and learning. Lots of helpful resources here.
Best of luck on this journey! I hope you enjoy it and build projects that are exciting and meaningful for you.
Feel free to comment below with your thoughts and Tweet me at @sivx76 or @autodidactproj as we post a new learning resource, book review, study guide and much more every week.
This is part of a larger series known as the Autodidact Project. For more information about this journey and to join — see this article: