How good is your machine learning model?

This is a memo to share what I have learnt in Model Validation (using Python), capturing the learning objectives as well as my personal notes. The course is taught by Kasey Jones from DataCamp.

Image for post
Image for post
Photo by Craige McGonigle on Unsplash

A machine learning model needs to go through proper validation in order to ensure optimum model performance on new data.

I have learnt the following topics:

  • Basics of model validation
  • Accuracy and evaluation metrics
  • Splitting data into train, validation, and test sets
  • Validation techniques
    Cross-validation and LOOCV
  • tools for creating validated and high performing models
  • Hyperparameter tuning

More notes and codes can be found on my GitHub.

Overall, I have enjoyed learning this course and would highly recommend it!

Fundamental concepts in supervised machine learning

This is a memo to share what I have learnt in Machine Learning with Tree-Based Models (using Python), capturing the learning objectives as well as my personal notes. The course is taught by Elie Kawerk from DataCamp.

Image for post
Image for post
Photo by Keith Jonson on Unsplash

Decision trees are supervised learning models used for problems involving classification and regression.

I have learnt the following topics:

  • Use Python to train decision trees and tree-based models.
    Decision-Tree Learning, applying CART algorithm to train decision trees for classification/regression problems.
  • Generalization Error of a supervised learning model, to diagnose underfitting and overfitting using Cross-Validation.
    Ensembling can produce better results than individual decision trees.
  • Advantages and disadvantages of trees.
    Bagging, applied randomization through bootstrapping and constructed a diverse set of trees in an ensemble through bagging. …

A beginner’s guide to the basic concepts of Apache Airflow

This is a memo to share what I have learnt in Apache Airflow, capturing the learning objectives as well as my personal notes. The course is taught by Mike Metzger from DataCamp.

Image for post
Image for post
Photo by Jacek Dylag on Unsplash

A data engineer’s job includes writing scripts, adding complex CRON tasks, and trying various ways to meet an ever-changing set of requirements to deliver data on schedule. Airflow can do all these while adding scheduling, error handling, and reporting.

I have learnt the following topics:

  • Workflows / DAGs / Tasks
  • Operators (BashOperator, PythonOperator, BranchPythonOperator, EmailOperator)
  • Dependencies between tasks / Bitshift operators
  • Sensors (to react to workflow conditions and…

Multiple Linear Regression, R², Adjusted R², MSE, p-value

Statistics and coding are fundamentally important in the data science field. Since a lot of a data science work is carried out with code, I would highly recommend learning statistics with a heavy focus on coding, preferably in Python or R.

Image for post
Image for post
Photo by Michael Dziedzic on Unsplash

In my previous article, I shared about how to code summary statistics (Mean, Median, Mode, Max, Min, Range, Quartile, Inter-Quartile Range, Standard Deviation, Variance) of a dataset and the Simple Linear Regression.

In this article, I shall cover the following topics with codes in Python 3:
• multiple linear regression models
• model performance metrics: R², Adjusted R², MSE…

When your ecommerce business grows

Image for post
Image for post
Photo by Mark König on Unsplash

If your ecommerce business is progressing to the Cloud, you need to be familiar with these three main types of cloud computing:

  • IaaS — Infrastructure as a Service
  • PaaS — Platform as a Service
  • SaaS — Software as a Service

These are all experiencing a surge in popularity as more businesses move to the Cloud. Gartner forecasts worldwide public cloud revenue to grow 17% in 2020. With growth rates like these, cloud computing will soon be the industry norm, and many businesses are phasing out on-prem software altogether.

Utilizing cloud computing is a great way to future-proof your business.

Image for post
Image for post
Photo by Donald Giannatti on Unsplash

Definition of On-Prem, SaaS, PaaS, IaaS

It was actually not so long ago, that every company’s IT systems were on-prem (located at the company’s premises), and clouds were only those white fluffy stuff in the sky. …

Machine Learning from labelled data to make predictions

This is a tutorial to share what I have learnt in Supervised Learning with scikit-learn, capturing the learning objectives as well as my personal notes. The course is taught by Hugo Bowne-Anderson from DataCamp.

Image for post
Image for post
Photo by Andy Kelly on Unsplash

Is a particular email spam?
Will a tumor be benign or malignant?
Which of your customers will take their business elsewhere?

These questions can be answered by Machine learning algorithms, where computers learn from existing data to make predictions on new data.

I have learnt the following topics:

  • Using machine learning techniques to build predictive models
  • for both regression and classification problems
  • using real-world data
  • Concept of underfitting and overfitting
  • Train-Test split
  • Cross-validation
  • Grid search to fine tune models and report performance
  • Regularisation: lasso, ridge, elasticnet
  • Data preprocessing
  • Pipeline

More notes and codes can be found on my GitHub.

Overall, I have enjoyed learning this course and would highly recommend it!

Continue to speak the statistical language of your data

Previous tutorial: Statistical Thinking in Python (Part 1)

This is a tutorial to share what I have learnt in Statistical Thinking in Python (Part 2), capturing the learning objectives as well as my personal notes. The course is taught by Justin Bois from DataCamp.

Image for post
Image for post
Photo by ThisisEngineering RAEng on Unsplash

To build the probabilistic mindset and foundational coding stats skills to dive into data sets and extract useful information from them.

I have learnt the following statistical thinking skills:

1. Perform EDA
(a) Generate effective plots like ECDFs
(b) Compute summary statistics

2. Estimate parameters
(a) By optimisation, including linear regression
(b) Determine confidence intervals

3. Formulate and test statistical…

Speak the statistical language of your data

This is a tutorial to share what I have learnt in Statistical Thinking in Python (Part 1), capturing the learning objectives as well as my personal notes. The course is taught by Justin Bois from DataCamp, and it includes 4 chapters.

Image for post
Image for post
Photo by Chris Liverani on Unsplash

The end goal of gathering data is to make clear, summary conclusions from them. This crucial last step of a data analysis pipeline hinges on the principles of statistical inference.

I have learnt the following statistical thinking skills:

  • Graphical exploratory data analysis (EDA), Quantative EDA
  • Construct (beautiful) instructive plots, including histogram, swarmplot, Empirical Cumulative Distribution Functions (ECDF), Box Plots, Scatter…

How feature extraction techniques can reduce dimensionality

This is a tutorial to share what I have learnt in Dimensionality Reduction in Python, capturing the learning objectives as well as my personal notes. The course is taught by Jerone Boeye from DataCamp, and it includes 4 chapters.

Image for post
Image for post
Photo by Aditya Chinchure on Unsplash

High-dimensional datasets have high complexity and can be computationally expensive to process. Reduce dimensionality by dropping features that are duplicate of other features, dropping irrelevant features, and using feature extraction techniques (through the calculation of uncorrelated principal components).

I have learnt the following topics:

  • Why dimensional reduction is important and when to use it
  • How to explore high dimensional data
  • How to identify duplicate features (high correlation in correlation…

Using the new Tableau version 2020.x onwards, with The World Bank GDP data preparation in Python 3

Bar chart race in action (music added): https://youtu.be/QQ9dw7gpbIM

A bar chart race has become very popular recently. At the beginning of 2020, Tableau released 2020.x version with a new Animations feature for dynamic parameters. This means that the bar chart race below can now be built easily in 6 minutes.

Image for post
Image for post
https://public.tableau.com/profile/blackraven#!/vizhome/Top10CountriesHistoricalGDPByYear/Top10CountriesHistoricalGDPByYear

This tutorial is a step-by-step guide to build a bar chart race based on historical Gross Domestic Product (GDP) data. To build a bar chart race is to create many discrete pages of bar charts and then string them together, just like how a traditional cartoon animation is built.

Step 1: Get ready the software and data

Download and install Tableau Public (latest version 2020.1.2 onwards). It is free of charge with full features. The only snag is that any work done can only be published on the Tableau Public server, and not saved locally to your Desktop. This is alright if the data is not sensitive or private. …

About

Black_Raven (James Ng)

perpetual student, fitness enthusiast, passionate explorer https://www.linkedin.com/in/jnyh/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store