How I Got Started in Data Science

Published in

The Startup

11 min readSep 16, 2020

Why Learn Data Science?
What Exactly is Machine Learning (ML) in Data Science?
An ML ‘Syllabus’
Online Courses
Step by Step ML Process
Example ML Algorithms
ML Applications
Taking this Further

1. Why Learn Data Science?

Data Science is a skill set formed from multiple disciplines including maths, statistics, programming, and business knowledge, which can be applied to a broad range of problems. It is a field for organising large data sets, analysing data and coding solutions to address business challenges.

Data Science — the intersection of maths/statistics, computer science and business domain knowledge

With this knowledge, you could:

Design a Spotify music recommender system
Predict future stock prices
Build a facial recognition system
… and many more!

Not experienced in these fields? No problem! Now it’s easier than ever to learn Data Science, with a wealth of online resources, Medium articles and YouTube tutorials!

Just a note about this article: it is quite a detailed overview — you don’t need to read the whole thing at once! Treat this as a guide to keep checking back to as you learn, and feel free to skip bits you don’t need or are already familiar with!

2. What Exactly is Machine Learning (ML) in Data Science?

Machine learning is where a computer takes a series of inputs, learns patterns, and produces outputs. Returning to our examples:

Spotify: input features to describe the songs you like (e.g. time signature, key, lyrics), learn patterns (often looking at patterns derived for similar listeners to yourself), output your ‘Discover Weekly’ playlist
Predict future stock prices: input previous stock market data, detect underlying trends, output prediction
Build a facial recognition system: input an image of someone’s face, compare it to a stored database of faces, identify the person

In order to do these things, a computer must first learn from train data, and then be checked using test data. You usually provide input (x) and output (y) data in both cases.

When starting out, there are 2 broad categories of ML, supervised and unsupervised:

Supervised — labelled train data. The algorithm wants to learn the rules connecting an input to a given output, and to use those rules for making predictions e.g. using which past job applicants got hired to decide whether to hire a new applicant, this has a yes/no label indicating whether the applicant was hired.

Generally 5 variables to deal with:
- x_train: features based on someone’s CV e.g. number of GCSEs, number of job requirements matched
- y_train: binary (yes/no) labels for whether or not someone got hired
- x_test: same as x_train but for new data
- y_pred: the predicted output when the model is trained on x_train and y_train, and then fed the input data x_test
- y_test: same as y_train but for the new data. y_test is compared to y_pred to evaluate how well the model performs

Supervised Learning — Regression vs Classification

Unsupervised — unlabelled train data. The algorithm finds structures and patterns of inputs on its own e.g. clustering types of customer based on demographic info and their spending habits. There are no labels as the customer segments are not yet known in this example.

Generally 2 variables:
- X: features about a customer
- y: the model’s output when trained on the X data

Unsupervised Learning — Clustering in 3D, with 3 features used to identify types of flower

There are also a number of other types of ML. You often start with a simple artificial neural network (ANN), which is designed like the human brain with neurons which strengthen when fired more often. Then there’s reinforcement learning, natural language processing (NLP), convolutional neural networks (CNN), deep neural networks (DNN), recurrent neural networks (RNN), and many more!

3. An ML ‘Syllabus’

This is the most comprehensive video I’ve found so far on how to get started. These steps are outlined in the video:

1. Install Python/R and the relevant libraries

It’s up to you which — R was developed for statisticians, whilst Python is more general-purpose, being the most popular option.

Let’s go with Python. Download Anaconda, and it’s common to use Jupyter Notebook. You can find lots of useful Python libraries on PyPI. You will want to install:

numpy — indexing, basic operations on arrays, reshaping, broadcasting arrays
pandas — dataframes, series, feature engineering. For more advanced pandas tips, see this video
Visualisation libraries: matplotlib, seaborn
Also helpful: scikit-learn — for ML models. Later on you might want tensorflow, keras, pytorch, but leave these for the moment
Just for fun: geopandas — for plotting maps and spatial coordinates

2. Statistics

Mean, median, mode
Normal and standard distributions
Correlations

No need to learn formulae by-heart. You mostly want a general understanding of the data you’ve got and how features are related.

3. Exploratory Data analysis (EDA)

Understanding the features in a data set
Data scaling — MinMax, Standard, LogNormal

4. Understanding ML algorithms, focussing on

Intuition — watch videos with good diagrams of what’s going on
Implementation with python libraries — you don’t need to code all the inner workings of an ML model yourself, someone has already done it for you!

5. Deployment

Cloud computing deployment: AWS (Amazon Web Services), GCP (Google Cloud Platform), Microsoft Azure. You may need to pay for these services, but there are free online tutorials to get started
Flask & Django — Python frameworks to turn your model into an interactive web interface. Flask is easier to get started with than Django

6. Databases

SQL for structured data (in table format). Click here for a SQL tutorial website
MongoDB for unstructured data (e.g. in json format)

7. Other visualisation software

Tableau — free on Tableau Public
PowerBI — paid
Qlik Sense — free trial

4. Online Courses

Next I opted for these two Udemy courses on Machine Learning and Deep Learning. You can find the syllabus info for these courses here and here. Note there is some overlap between the courses, and only buy them when they’re on offer (about £15 each, try an incognito tab or checking back regularly as Udemy often has sales on). They have a great overview of different ML techniques.

5. Step by Step Machine Learning (ML) Process

Start with a dataset — this can be provided for you on Kaggle, or you can use other online datasets, APIs, or collect your own data. You can start using an excel file or csv. For unstructured data, use json. For large files, hdf files are helpful. If you want, you can query your data from a database and use that directly.

You will need a sufficient amount of data to be able to train a model — a good rule of thumb is roughly 10 times as many data points as features (inputs) in your model. For example, if you have 4 features in a salary prediction algorithm (job sector, role title, years of experience, performance review score), you will want data on at least 40 people

Data cleaning — Check how much data you actually have and its quality

For each feature, what % of the data has a NaN (missing or unknown) value? You can often handle NaN values by replacing the NaN with the mean

df = df.fillna(df.mean())

Check for and potentially remove anomalies — anomalies may skew the data when calculating the mean or training a general model, but may be useful if the purpose of your model is to spot these anomalies (e.g. in a fraud detection system aiming to spot unusual behaviour)
Ensure that the data you have is of the correct type (sometimes numbers get stored as strings and need to be converted back to int/float)

Feature engineering — decide which features will go into your model

You can create new features by manipulating the data e.g. for a car, take the ratio of fuel usage to distance travelled, or for a time series problem, calculate rolling averages or find the mean/standard deviation/frequency as features
Dimensionality reduction — cutting down on the number of features you’ve got. For this you can use Principal Component Analysis (PCA): this method lets you see which variables are most important in explaining the variance of the dataset
If you have any categorical variables (e.g. favourite colour), you will want to OneHotEncode them, meaning split them into separate columns for each category (e.g. red, yellow, blue), having a value of 0 or 1
You can produce correlation and covariance plots. If two features are correlated, you may not want to include them both in the model as the second feature will be redundant (e.g. how many products you sell, how much profit you make from that product type). This could result in over-fitting, where a model works very well on train data but not on test data

Correlation Matrix — red indicates positive correlation, blue indicates negative correlation

Next is the train-test split. Split up the data (often randomly, but sometimes in time-series chronological order), often with a 80:20 train:test split.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

You’ll need to feature scale the data, which often requires normalising it (if you have a skewed distribution, you can use log(x) to normalise, and then you scale. There are many types of scaling but a common one is min-max scaling between 0 and 1. This is so that the actual value of a feature doesn’t affect the model too much (e.g. when looking at car safety, a speed of 60mph shouldn’t be weighted more highly than an acceleration of 2m/s^2 just because speed values generally tend to be higher)

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

Comparing different feature scaling methods

With this normalisation used on the train set, you’ll want to apply the same normalisation separately to the test set. Don’t scale all the data together or there will be data leakage effects, where knowledge from the test set creeps into the train set so the ML model can cheat

Apply the learning algorithm (examples outlined below) to the train set, using:

model.fit(X_train, y_train)

Then apply the model to the test set:

y_pred = model.predict(X_test)

You can then make a plot of the output, or compare y_pred to y_test — there are a number of metrics for model scoring, which use these terms:

True Position (TP) — e.g. is ill and tests positive (correct)
False Negative (FN) — e.g. is ill but tests negative (incorrect) — type 2 error (more bad)
False Position (FP) — e.g. is not ill but tests positive (incorrect) — type 1 error (less bad)
True Negative (TN) — e.g. is not ill and tests negative (correct)

There are lots of metrics you can use, as shown in the diagrams: accuracy is very important, and then either precision or recall depending on what the context of your model is

ML model evaluation metrics — precision & recall

It can also be helpful to make a confusion matrix, formed from the 4 white squares in the middle of the diagram below, with each square containing a number. In general you want high numbers of data points in TP or TN, and low numbers in FP and FN

Confusion matrix to show the numbers of TP, TN, FP, FN values

Evaluate the model — try out a few different models (outlined below) and compare the confusion matrix and metrics, to see which performance is best. It depends on the problem you are trying to solve.

Also try some k-fold cross validation, where you shuffle which data points are put in the train and tests sets, retrain the model, and see if it performs in a similar way and didn’t just get lucky the first time

6. Example ML Algorithms

Within the ML categories, we have a number of different algorithms and libraries:

To discuss some of these further:

Regression — find the straight line or curve to fit the data

Linear — find the coefficients of the relationship: y = b0 + b1x1
Polynomial — find the coefficients of the relationship: y = b0 + b1x1 + b2x1^2 + … + bnx1^n
Decision Tree — find the hidden rules that determine an outcome

Decision Tree for deciding whether or not to grant a loan

Random Forest — an ensemble learning method combining results from many decision trees to produce a single output

Classification

KNN — K-Nearest Neighbours — k is a parameter set by you. For each data point in a distribution, pick the k nearest points, and the data point in question gets assigned the same label as the majority of its k-nearest-neighbours
Logistic Regression — like linear regression but a sigmoid curve is used to produce an output between 0 and 1, which can be converted to a binary output by rounding up or down

Linear Regression vs Logistic Regression

Naïve-Bayes — a probabilistic classifier based on applying Bayes’ theorem with strong independence assumptions between features
SVM — a supervised method for classification, based on finding a line or hyperplane that nicely divides clusters of data points

And here’s a higher-level flowchart for deciding which to use:

7. ML Applications

There are a number of exciting applications of these types of ML:

8. Taking this further

The easiest next step is probably Kaggle, which has example data sets, model solutions, and competitions to take part in!

There are a number of popular datasets online:

Hand-written digit classification on the MNIST dataset
Flower species classification using the iris dataset
Survival classification on the Titanic dataset

For further reading, these websites are great for tutorials or project ideas:

SuperDataScience, and also their podcast
Machine Learning Mastery
Towards Data Science, or anything on Medium
Dataquest
mlcourse.ai
fast.ai

And you can go in search of your own data sets for more personalised projects:

Analysing coronavirus stats
Using gov data
50 practical data science problems, from chatbots to fake news detection
Web scraping
RapidAPI has lots of fun APIs to try out — I can recommend the Spotify API (their website also has some cool example projects) and OpenStreetMap

Hope you found this helpful, and good luck in your Data Science journey!