Guide To Breaking Into Data Science

What is data science, why it matters and how to learn it —

Ben Alemu
7 min readJan 15, 2020

The purpose of this article is to: concisely explain what data science is, why it is useful and recommend some of the best resources for learning the discipline.

Data Science is the quantitative science of processing large amounts of unstructured data to form clearer insights and organized information.

Data may be considered“small”, as can be understood from a typical spreadsheet or computer file OR “big”, beyond the processing ability of a single computer (such as terabytes of data).

The goal is to condense all this noise to form insights that humans can understand. Sample tasks in data science include exploratory data analysis, predictive modeling, visualizations, statistical modeling and much more.

See below for much more detail on common important applications of data science, via machine learning and other approaches.

The most popular programming languages for data science are Python and R. Many opensource libraries have been developed to make common tasks easier to perform through available functions for manipulating data.

Common data science libraries include Numpy, Pandas, Matplotlib and several Machine Learning libraries like Scikit-learn, Tensorflow, PyTorch, Keras and others.

Machine Learning models can be grouped into basic categories based on their intended functions.

This is part of a larger series known as the Autodidact Project. For more information about this journey and to join — see this article:

Common Realword Cases of Machine Learning:

Here are some uses of machine learning and deep learning models:

Machine Learning algorithms:

  • Predicting future stock prices based on past trading data (example of a regression task — producing a quanitative output [ie. $723.11])
  • Predicting if someone will default or not on a bank loan (example of a classification task — producing a binary output [ie. Yes or No])
  • Clustering data into related groups. Which data records share similar qualities and what do they share in common?
  • Reducing the dimensionality of multi-omics medical data into its most important characterstiics (PCA, principal coordinate analysis)
  • Detect if an event is unusual or not (anomaly detection)
Plot of a regression model.

Deep Learning algorithms:

  • Understanding human hand-writing (image recognition)
  • Generate text-to-speech (audio recognition)
  • Teaching a computer how to play and beat a video game (reinforcement learning)
A generalized mental model for distinguishing between AI, ML and Deep Learning.

Decision-making algorithms:

  • YouTube, Netflix, Amazon search recommendations based on your video views or shopping history. What is a user likely to want to watch or buy based on their past viewing history or shopping cart?
  • Creating a decision tree. Building a visual and human-interpretable series of if-then statements.
Example of a decision tree.

Natural Language Processing (NLP):

  • Conversational agent (ie. iPhone’s Siri personal assistant can understand some of our requests, through our voice)
  • Translating between languages (ie. Google Translate)
  • Sentiment analysis (ie. Is this a happy or upset Yelp review?)
  • Building a chatbot (A computer listens and responds “on-the-fly” with pre-programmed responses)
Photo by Austin Distel on Unsplash

The Typical Stages of a Machine Learning Project:

Machine Learning projects tend to follow a similar sequence of steps. Understanding this high-level model will make your learning process easier. The typical stages of a machine learning or data science project are as follows:

  1. Acquire the dataset. Common approaches include: fetching from a database via a SQL query, importing a computer file or downloading data from the web.
  2. Clean the data. Remove any missing values, reformat columns, exclude unnecessary info.
  3. Exploratory data analysis. Examine the data, produce summary statistics and preliminary plots to enable idea generation.
  4. Decide what model design is the best based on your data.
  5. Build a version of the model using an algorithm from an imported library. For example: building a linear model or classifer.
  6. Train the predictive model based on your data (train set).
  7. Test how well it performs on new data, it has never seen (test set).
  8. Tweak the hyperparameters to increase the model’s performance.
  9. Perform data storytelling. Produce plots and share the results of your final model.

Field overview:

Consider these very helpful articles and YouTube videos to get a deeper dive on the data science profession and interesting machine learning projects.

I recommend reading these articles and watching these videos before starting your own learning plan (aka an Autodidact Project).

Recommended Courses and Projects(8):

Consider taking these courses — in this sequence. Feel free to modify your own Autodidact Project.

(1) Intro to Tableau

Purpose: Study how to use Tableau, a simple program for creating a series of visual and interactive workbooks.

Why?

Make the data exploration process (preliminary analysis) easier and also your reporting of results (post analysis) easier.

Can be easily exported online and used for storytelling of data

Format: Online class, Udacity

Link:

Length: 1–2 weeks

Difficulty: Easy

Cost: Free

(2) Practical Statistics for Data Science

Purpose: Does a comprehensive walkthrough of data science — starting form the basics of relevant statistics you need to know. Progresses to the main categories of machine learning in solid detail: (a) regression models, (b) classification models and (c) unsupervised learning

The first 3 chapters cover statistics. The last 4 chapters cover machine learning principles.

Format: Book, available on Amazon

Link:

https://www.amazon.com/Practical-Statistics-Data-Scientists-Essential/dp/1491952962

Length: 7 chapters, 284 pages (1–2 months)

Difficulty: Medium

Cost: $25

(3) Python for Data Analysis

These are the core data science Python libraries you should gain more profficency with:

  • Numpy — how to form N-dimensional arrays for extremely fast data processing
  • Matplotlib — how to make basic graphs and plots using Python
  • Pandas — how to load data from CSV, Excel files and other 2D data with rows and columns into a DataFrame data structure

Format: Book, available on Amazon

Link:

https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662/

Length: 12 chapters, 466 pages

Difficulty: Medium

Cost: $29

(4) Any Machine Learning workshop

Pick any class that gives you hands-on practice creating and using your own machine-learning model.

Format: Online

Length: 1–2 months

Difficulty: Hard

Cost: Varies

(5) Generate a Machine Learning portfolio project!

Build any project that interests you!

Consider placing on a GitHub account and on your resume

Free Udacity class on How To Use GitHub — Link

For inspiration — look at projects at Kaggle.com

(6) Deep Learning with Python

Purpose:

Deep Learning is a subset of Machine Learning.

Machine Learning is focused on building one model (regression, classification or etc.) to form a prediction.

Deep Learning forms layers of multiple models and equations, which operate based on 1000s of parameters that are optimized. Deep Learning is a tool with many applications in image recognition, speech recognition and understanding extraordinarily-complex data.

Study Artificial Neural Networks, including recurrent neural networks and convolutional neural networks.

Format: Book, available on Amazon

Link:

https://www.amazon.com/Deep-Learning-Python-Francois-Chollet/dp/1617294438/

Length: 2–3 months

Difficulty: Hard

Cost: $32

(7) Practical SQL in 10 minutes

Purpose:

The SQL programming language is used for fetching large amounts of data from databases.

Data is often structured in relational tables. These tables have column names with set data types, character limits and formatting.

Learn how to build SQL queries to fetch any data you need for data science, easily.

Format: Book, available on Amazon

Link:

https://www.amazon.com/Deep-Learning-Python-Francois-Chollet/dp/1617294438/

Length: 1–2 months, 23 short chapters

Difficulty: Medium

Cost: $28

(8) Build one more deep learning project, to place in your portfolio

Other / bonus preparation:

Use these additional learniinig resources to further your study and career advancement.

Database and Data Warehouse Design

Study how data is organized and stored.

Kaggle.com — a collection of data science competitions

How do I find interesting practice problems?

Andrew Ng’s famous Machine Learning Course — Coursera

Probably the world’s most famous intro to machine learning course

Natural Language Processing (NLP) book

How do I process textual data?

Deeplearning.ai

Use this for more practice and learning. Lots of helpful resources here.

Fast.ai

Use this for more practice and learning. Lots of helpful resources here.

Best of luck on this journey! I hope you enjoy it and build projects that are exciting and meaningful for you.

Feel free to comment below with your thoughts and Tweet me at @sivx76 or @autodidactproj as we post a new learning resource, book review, study guide and much more every week.

This is part of a larger series known as the Autodidact Project. For more information about this journey and to join — see this article:

--

--

Ben Alemu

Benyam Alemu Sood works as a Mobile App Developer. He teaches iOS Class at Stanford CSP. He & his wife consult for Fortune 1000 companies and small businesses.