How I Got Started in Data Science

Caitlin French
The Startup
Published in
11 min readSep 16, 2020

Contents

  1. Why Learn Data Science?
  2. What Exactly is Machine Learning (ML) in Data Science?
  3. An ML ‘Syllabus’
  4. Online Courses
  5. Step by Step ML Process
  6. Example ML Algorithms
  7. ML Applications
  8. Taking this Further

1. Why Learn Data Science?

Data Science is a skill set formed from multiple disciplines including maths, statistics, programming, and business knowledge, which can be applied to a broad range of problems. It is a field for organising large data sets, analysing data and coding solutions to address business challenges.

Data Science — the intersection of maths/statistics, computer science and business domain knowledge

With this knowledge, you could:

Not experienced in these fields? No problem! Now it’s easier than ever to learn Data Science, with a wealth of online resources, Medium articles and YouTube tutorials!

Just a note about this article: it is quite a detailed overview — you don’t need to read the whole thing at once! Treat this as a guide to keep checking back to as you learn, and feel free to skip bits you don’t need or are already familiar with!

2. What Exactly is Machine Learning (ML) in Data Science?

Machine learning is where a computer takes a series of inputs, learns patterns, and produces outputs. Returning to our examples:

  • Spotify: input features to describe the songs you like (e.g. time signature, key, lyrics), learn patterns (often looking at patterns derived for similar listeners to yourself), output your ‘Discover Weekly’ playlist
  • Predict future stock prices: input previous stock market data, detect underlying trends, output prediction
  • Build a facial recognition system: input an image of someone’s face, compare it to a stored database of faces, identify the person

In order to do these things, a computer must first learn from train data, and then be checked using test data. You usually provide input (x) and output (y) data in both cases.

When starting out, there are 2 broad categories of ML, supervised and unsupervised:

Types of ML — an overview

Supervised — labelled train data. The algorithm wants to learn the rules connecting an input to a given output, and to use those rules for making predictions e.g. using which past job applicants got hired to decide whether to hire a new applicant, this has a yes/no label indicating whether the applicant was hired.

Generally 5 variables to deal with:

- x_train: features based on someone’s CV e.g. number of GCSEs, number of job requirements matched

- y_train: binary (yes/no) labels for whether or not someone got hired

- x_test: same as x_train but for new data

- y_pred: the predicted output when the model is trained on x_train and y_train, and then fed the input data x_test

- y_test: same as y_train but for the new data. y_test is compared to y_pred to evaluate how well the model performs

Supervised Learning — Regression vs Classification

Unsupervised — unlabelled train data. The algorithm finds structures and patterns of inputs on its own e.g. clustering types of customer based on demographic info and their spending habits. There are no labels as the customer segments are not yet known in this example.

Generally 2 variables:

- X: features about a customer

- y: the model’s output when trained on the X data

Unsupervised Learning — Clustering in 3D, with 3 features used to identify types of flower

There are also a number of other types of ML. You often start with a simple artificial neural network (ANN), which is designed like the human brain with neurons which strengthen when fired more often. Then there’s reinforcement learning, natural language processing (NLP), convolutional neural networks (CNN), deep neural networks (DNN), recurrent neural networks (RNN), and many more!

A simple neural network

3. An ML ‘Syllabus’

This is the most comprehensive video I’ve found so far on how to get started. These steps are outlined in the video:

1. Install Python/R and the relevant libraries

It’s up to you which — R was developed for statisticians, whilst Python is more general-purpose, being the most popular option.

Let’s go with Python. Download Anaconda, and it’s common to use Jupyter Notebook. You can find lots of useful Python libraries on PyPI. You will want to install:

  • numpy — indexing, basic operations on arrays, reshaping, broadcasting arrays
  • pandas — dataframes, series, feature engineering. For more advanced pandas tips, see this video
  • Visualisation libraries: matplotlib, seaborn
  • Also helpful: scikit-learn — for ML models. Later on you might want tensorflow, keras, pytorch, but leave these for the moment
  • Just for fun: geopandas — for plotting maps and spatial coordinates

2. Statistics

  • Mean, median, mode
  • Normal and standard distributions
  • Correlations

No need to learn formulae by-heart. You mostly want a general understanding of the data you’ve got and how features are related.

3. Exploratory Data analysis (EDA)

  • Understanding the features in a data set
  • Data scaling — MinMax, Standard, LogNormal

4. Understanding ML algorithms, focussing on

  • Intuition — watch videos with good diagrams of what’s going on
  • Implementation with python libraries — you don’t need to code all the inner workings of an ML model yourself, someone has already done it for you!

5. Deployment

  • Cloud computing deployment: AWS (Amazon Web Services), GCP (Google Cloud Platform), Microsoft Azure. You may need to pay for these services, but there are free online tutorials to get started
  • Flask & Django — Python frameworks to turn your model into an interactive web interface. Flask is easier to get started with than Django

6. Databases

  • SQL for structured data (in table format). Click here for a SQL tutorial website
  • MongoDB for unstructured data (e.g. in json format)

7. Other visualisation software

  • Tableau — free on Tableau Public
  • PowerBI — paid
  • Qlik Sense — free trial

4. Online Courses

Next I opted for these two Udemy courses on Machine Learning and Deep Learning. You can find the syllabus info for these courses here and here. Note there is some overlap between the courses, and only buy them when they’re on offer (about £15 each, try an incognito tab or checking back regularly as Udemy often has sales on). They have a great overview of different ML techniques.

5. Step by Step Machine Learning (ML) Process

Step by step ML process

Start with a dataset — this can be provided for you on Kaggle, or you can use other online datasets, APIs, or collect your own data. You can start using an excel file or csv. For unstructured data, use json. For large files, hdf files are helpful. If you want, you can query your data from a database and use that directly.

  • You will need a sufficient amount of data to be able to train a model — a good rule of thumb is roughly 10 times as many data points as features (inputs) in your model. For example, if you have 4 features in a salary prediction algorithm (job sector, role title, years of experience, performance review score), you will want data on at least 40 people

Data cleaning — Check how much data you actually have and its quality

  • For each feature, what % of the data has a NaN (missing or unknown) value? You can often handle NaN values by replacing the NaN with the mean

df = df.fillna(df.mean())

  • Check for and potentially remove anomalies — anomalies may skew the data when calculating the mean or training a general model, but may be useful if the purpose of your model is to spot these anomalies (e.g. in a fraud detection system aiming to spot unusual behaviour)
  • Ensure that the data you have is of the correct type (sometimes numbers get stored as strings and need to be converted back to int/float)

Feature engineering — decide which features will go into your model

  • You can create new features by manipulating the data e.g. for a car, take the ratio of fuel usage to distance travelled, or for a time series problem, calculate rolling averages or find the mean/standard deviation/frequency as features
  • Dimensionality reduction — cutting down on the number of features you’ve got. For this you can use Principal Component Analysis (PCA): this method lets you see which variables are most important in explaining the variance of the dataset
  • If you have any categorical variables (e.g. favourite colour), you will want to OneHotEncode them, meaning split them into separate columns for each category (e.g. red, yellow, blue), having a value of 0 or 1
  • You can produce correlation and covariance plots. If two features are correlated, you may not want to include them both in the model as the second feature will be redundant (e.g. how many products you sell, how much profit you make from that product type). This could result in over-fitting, where a model works very well on train data but not on test data
Correlation Matrix — red indicates positive correlation, blue indicates negative correlation

Next is the train-test split. Split up the data (often randomly, but sometimes in time-series chronological order), often with a 80:20 train:test split.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

  • You’ll need to feature scale the data, which often requires normalising it (if you have a skewed distribution, you can use log(x) to normalise, and then you scale. There are many types of scaling but a common one is min-max scaling between 0 and 1. This is so that the actual value of a feature doesn’t affect the model too much (e.g. when looking at car safety, a speed of 60mph shouldn’t be weighted more highly than an acceleration of 2m/s^2 just because speed values generally tend to be higher)

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)

X_test = sc_X.transform(X_test)

Comparing different feature scaling methods
  • With this normalisation used on the train set, you’ll want to apply the same normalisation separately to the test set. Don’t scale all the data together or there will be data leakage effects, where knowledge from the test set creeps into the train set so the ML model can cheat

Apply the learning algorithm (examples outlined below) to the train set, using:

model.fit(X_train, y_train)

Then apply the model to the test set:

y_pred = model.predict(X_test)

  • You can then make a plot of the output, or compare y_pred to y_test — there are a number of metrics for model scoring, which use these terms:

True Position (TP) — e.g. is ill and tests positive (correct)

False Negative (FN) — e.g. is ill but tests negative (incorrect) — type 2 error (more bad)

False Position (FP) — e.g. is not ill but tests positive (incorrect) — type 1 error (less bad)

True Negative (TN) — e.g. is not ill and tests negative (correct)

  • There are lots of metrics you can use, as shown in the diagrams: accuracy is very important, and then either precision or recall depending on what the context of your model is
ML model evaluation metrics — precision & recall
  • It can also be helpful to make a confusion matrix, formed from the 4 white squares in the middle of the diagram below, with each square containing a number. In general you want high numbers of data points in TP or TN, and low numbers in FP and FN
Confusion matrix to show the numbers of TP, TN, FP, FN values

Evaluate the model — try out a few different models (outlined below) and compare the confusion matrix and metrics, to see which performance is best. It depends on the problem you are trying to solve.

  • Also try some k-fold cross validation, where you shuffle which data points are put in the train and tests sets, retrain the model, and see if it performs in a similar way and didn’t just get lucky the first time
K-fold cross validation

6. Example ML Algorithms

Within the ML categories, we have a number of different algorithms and libraries:

Types of ML algorithms

To discuss some of these further:

Regression — find the straight line or curve to fit the data

  • Linear — find the coefficients of the relationship: y = b0 + b1x1
  • Polynomial — find the coefficients of the relationship: y = b0 + b1x1 + b2x1^2 + … + bnx1^n
  • Decision Tree — find the hidden rules that determine an outcome
Decision Tree for deciding whether or not to grant a loan

Random Forest — an ensemble learning method combining results from many decision trees to produce a single output

Random Forest made up of Decision Trees

Classification

  • KNN — K-Nearest Neighbours — k is a parameter set by you. For each data point in a distribution, pick the k nearest points, and the data point in question gets assigned the same label as the majority of its k-nearest-neighbours
  • Logistic Regression — like linear regression but a sigmoid curve is used to produce an output between 0 and 1, which can be converted to a binary output by rounding up or down
Linear Regression vs Logistic Regression
  • Naïve-Bayes — a probabilistic classifier based on applying Bayes’ theorem with strong independence assumptions between features
  • SVM — a supervised method for classification, based on finding a line or hyperplane that nicely divides clusters of data points
Support Vector Machine

And here’s a higher-level flowchart for deciding which to use:

Deciding which ML method to use

7. ML Applications

There are a number of exciting applications of these types of ML:

Applications of different types of ML

8. Taking this further

The easiest next step is probably Kaggle, which has example data sets, model solutions, and competitions to take part in!

There are a number of popular datasets online:

For further reading, these websites are great for tutorials or project ideas:

And you can go in search of your own data sets for more personalised projects:

Hope you found this helpful, and good luck in your Data Science journey!

--

--

Caitlin French
The Startup

Wharton MBA | Oxford Physics | McKinsey | Climate Tech & Entrepreneurship