An Introduction to Statistical Learning — Introduction

Day 1 notes from “An Introduction to Statistical Learning: with Applications in Python by Hastie et. al.” as part of my Data Science learning documentation.

Ahmad Yusuf Albadri
Python’s Gurus
6 min readJun 19, 2024

--

Photo by Artturi Jalli on Unsplash

Here, I’m combining my prior knowledge in Python and machine learning to help visualize and make the narrative to be applicable in code.

Machine (statistical) learning (ML) refers to a vast set of tools for understanding data.

It is mainly categorized into:

  • Supervised learning: building a (statistical) model for predicting or estimating an output based on one or more inputs.
  • Unsupervised learning: building a system (model or algorithm) to learn relationships and structure from data thus there are inputs but
    no supervising output like in the supervised learning.

The difference between supervised and unsupervised learning lies in the availability of output data (synonym: dependent variable, target variable, or outcome which is symbolized as “y”).

Let’s prepare our Python code first by importing the essential libraries.

# Import library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ISLP import load_data # ISLP provided by the author of the book

Here are some examples of cases in machine learning:

1. Wage Data: Supervised Learning — Regression

Let’s import the data.

# Import wage data from ISLP
df_wage = load_data('Wage')
df_wage
Wage Data. Image by Author.

This data can be used to examine factors related to wages for a group of men from the Atlantic region of the United States. It is clear that our output variable is wage and the input variables are the remaining variables in our data.

Wages in our data are quantitative values which means that we are facing a regression problem. Therefore, we are involved with predicting continuous or quantitative output.

Let's say we want to understand the association between an employee’s age, education, and calendar year on his wage. We can do this by visualizing the relationship between each of the input variables on wage using scatterplot and boxplot.

# Setup the matplotlib subplots
fig, ax = plt.subplots(1, 3, figsize=(15,4))

# Fig 1
sns.regplot(data=df_wage, x='age', y='wage', lowess=True, ax=ax[0], scatter_kws={'edgecolor': 'grey', 'facecolor': 'none', 'alpha': 0.5}, line_kws={'color': 'red'})

# Fig 2
sns.regplot(data=df_wage, x='year', y='wage', lowess=True, ax=ax[1], scatter_kws={'edgecolor': 'grey', 'facecolor': 'none', 'alpha': 0.5}, line_kws={'color': 'red'})

# Fig 3
temp = {}
for i in df_wage['education'].unique():
temp[i] = df_wage[df_wage['education'] == i]['wage'].reset_index(drop=True)
temp = pd.DataFrame(temp)
temp.columns = temp.columns.str.extract(r'(\d)')[0].astype(int)
temp = temp[[i for i in range(1, 6)]]
temp.plot(kind='box', ax=ax[2], color='black', xlabel='education level', ylabel='wage')
Association for Age, Year, and Education Leve to Wage Respectively. Image by Author.

Based on the graph above we can see that:

  • Wages increase with age until about 40 years old, then it slightly and slowly decreases after that.
  • Wages increased in a roughly linear (or straight-line) fashion, between 2003 and 2009, though this rise is very slight relative to the variability in the data.
  • Wages are also typically greater for individuals with higher education levels: men with the lowest education level (1) tend to have substantially lower wages than those with the highest education level (5).

Of course, it is possible that we can get more accurate predictions by combining age, education, and the year rather than using each input variable separately. This can be done by fitting a machine learning model to predict one wage based on its age, education, and calendar year.

2. Stock Market Data: Supervised Learning — Classification

Let’s import the data.

# Import stock market data from ISLP
df_smarket = load_data('Smarket')
df_smarket
Stock Market Data. Image by Author.

This data contains the daily movements in the Standard & Poor’s 500 (S&P) stock index over 5 years between 2001 and 2005. In this case, we are involved with predicting qualitative or categorical output, i.e., today’s stock market direction. This type of problem is called a classification problem. A model that could accurately predict the direction in which the market will move would be very useful!

Let’s understand the data a little bit by understanding the pattern in which the market direction is up or down based on each of the lag1, lag2, and lag3 percentage changes in S&P. We can do it by creating boxplots as follows:

# Setup the matplotlib subplots
fig, ax = plt.subplots(1, 3, figsize=(14, 4))

# Fig 1
temp = {}
for i in df_smarket['Direction'].unique():
temp[i] = df_smarket[df_smarket['Direction'] == i]['Lag1'].reset_index(drop=True)
temp = pd.DataFrame(temp)
temp.plot(kind='box', color='black', ax=ax[0], xlabel='Today\'s Direction', ylabel='Percentage Change in S&P', title='Yesterday')

# Fig 2
temp = {}
for i in df_smarket['Direction'].unique():
temp[i] = df_smarket[df_smarket['Direction'] == i]['Lag2'].reset_index(drop=True)
temp = pd.DataFrame(temp)
temp.plot(kind='box', color='black', ax=ax[1], xlabel='Today\'s Direction', ylabel='Percentage Change in S&P', title='Two Days Previous')

# Fig 3
temp = {}
for i in df_smarket['Direction'].unique():
temp[i] = df_smarket[df_smarket['Direction'] == i]['Lag3'].reset_index(drop=True)
temp = pd.DataFrame(temp)
temp.plot(kind='box', color='black', ax=ax[2], xlabel='Today\'s Direction', ylabel='Percentage Change in S&P', title='Three Days Previous')

plt.show()
Boxplots for Today’s Stock Direction Based on Yesterday, Two Days Previous, and Three Days Previous of Percentage Change in S&P. Image by Author.

Based on the graph above, it is clear that there is no visible difference between the percentage changes in S&P in which today’s direction is up or down whether it’s for lag1 (yesterday), lag2 (two days previous), and lag3 (three days previous). It suggests that there is no simple strategy for predicting how the market will move based on these 3 variables. If the pattern seems quite simple, then anyone could adopt a simple trading strategy to generate profits from the market. Instead, this type of problem can be solved by a machine learning model to predict today’s market with high accuracy.

3. Gene Expression Data: Unsupervised Learning — Dimensional Reduction & Clustering

Another important class of problems in machine learning involves situations in which we only observe input variables, with no corresponding output which is called unsupervised learning. Unlike in the previous examples, here we are not trying to predict an output variable.

Let’s check the example data from gene expression:

# Import gene expression data
df_gen = load_data('NCI60')
df_gen
Gene Expression Data. Image by Author.

This data consists of 6,830 gene expression measurements for each of 64 cancer cell lines. Instead of predicting a particular output variable, we are interested in determining whether there are groups, or clusters, among the cell lines based on their gene expression measurements. This is a difficult question to address, in part because there are thousands of gene expression measurements per cell line, making it hard to visualize the data.

Here, we can use unsupervised learning techniques such as dimensionality reduction and clustering to better understand the pattern in our data.

Let’s code!

# PCA for the first two components
from sklearn.decomposition import PCA

pca = PCA(n_components = 2)
Z = pca.fit_transform(df_gen['data'])

# K-Means cluster for 4 clusters, just for illustration
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4)
kmeans.fit(Z)

# Visualize the results
fig, ax = plt.subplots(1, 2, figsize=(10, 4))

# Fig 1
ax[0].scatter(x=Z[:,0], y=Z[:,1], edgecolors='black', facecolor='none')
ax[0].set_xlabel('Z1')
ax[0].set_ylabel('Z2')

# Fig 2
scatter = ax[1].scatter(x=Z[:,0], y=Z[:,1], c=kmeans.labels_)
legendc = ax[1].legend(*scatter.legend_elements(prop='colors'), loc="upper left", title="Cluster")
ax[1].set_xlabel('Z1')
ax[1].set_ylabel('Z2')
plt.show()
Visualized Clusters from 64 Cancer Cell Lines Using the First 2 Principal Components. Image by Author.

We are using the first two principal components of the data, which summarize the 6,830 expression measurements for each cell line down to two numbers or dimensions. While it is likely that this dimension reduction has resulted in some loss of information, it is now possible to visually examine the data for evidence of clustering. Deciding on the number of clusters is often a difficult problem. In the graph above, we are using 4 clusters for the sake of illustration. Based on this graph, there is clear evidence that cell lines with reasonably similar characteristics tend to be located near each other in this two-dimensional representation.

To be continued…

This is a part of my 100 days Data Science learning journey. Follow me for more updates on my learning.

You can learn from what I learned too!

Check out my plan: https://medium.com/pythons-gurus/a-journey-to-learn-data-science-100-days-plan-cfce919f6f6e

Python’s Gurus🚀

Thank you for being a part of the Python’s Gurus community!

Before you go:

  • Be sure to clap x50 time and follow the writer ️👏️️
  • Follow us: Newsletter
  • Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.

--

--

Ahmad Yusuf Albadri
Python’s Gurus

Data Analyst at an E-Commerce company | Sharing my learnings and thoughts related to Data Science, Data Analytics, and Statistics