Kaggle Kernelsから読み解く！構造化データの解き方（LD Freeman’s kernel of Titanic Data）

Taketo Kimura

Published in

MICIN Developers

140 min readMay 28, 2019

はじめに

皆さんは、Kaggleをご存知でしょうか？世界中の企業や研究者が投稿したデータを、世界中のData Scientistが解き、その精度を競い合うというコンペサイトです。中には高額な賞金が発生するコンペもあり、話題騒然です。

また、Kaggleで活躍していたData Scientistが、その看板を掲げて起業するというパターンも多く見られます。それだけGlobal Standardな競技分析サイトがKaggleです。

私自身は、いつか挑戦しようとユーザー登録はしていたのですが、時間が取れず、中身が見れていませんでした。しかし、先日、知人からKernelsの存在を聞き、驚愕しました。なんと、コンペ課題の解き方が公表されているというのです。それが、Kernelsです。正確にはKernelsは、Kaggleが提供している共同クラウド環境（ブラウザ上でプログラムが動かせる）であり、コミュニティでもあるという2面性を持つもののようです。

Kernels Documentation | Kaggle

Edit description

www.kaggle.com

私が特に注目したのは、後者のコミュニティです。なんと、コンペ課題の解き方が無料で見れるのです。私はこれまで、世のData Scientistは、所属のコミュニティでのみ、分析プロセスについてディスカッションが可能と思っておりました。が、そこはアメリカですね、発想が進んでいます。

という訳で、この記事では、そんなKaggle Kernelsを紐解き、Global StandardなData Scientistが、どんな分析プロセスを踏んでいるかをWatchしてみたいと思っております。最後までお付き合いいただけたら幸いです。

（※極力体系的にまとめようと心掛けておりますが、notebookの直訳なので、内容が割と散文的になってしまっております…🙇‍♂️💦）

「Titanic」コンペに見るGlobal Standardな分析プロセス

Kaggleの入門編的なコンペに「Titanic」があります。

Titanic: Machine Learning from Disaster

Start here! Predict survival on the Titanic and get familiar with ML basics

www.kaggle.com

「Titanic」は、世界中のData Scientistに最も解かれているであろう、著名なToy Dataの1つです。歴史上最も悪名高い難破船の1つであるのがタイタニック号です。その沈没時に、どんな状況下にあった乗客が生き残ったか、或いは、生き残れなかったかを、機械学習を用いて解く課題となっています。要は、乗客のステータスが入力された場合に、その乗客が生き残ったか否かを、予測するタスクです。

「Titanic」は、世界中のData Scientistに取り組まれている問題であるからこそ、様々な解き方がされています。特にTop KagglerのKernelはとても丁寧で、自己PRのために書いているんではないかと思います。この記事では、彼らが書いてくれている内容を上手いことまとめることを目指しています。

「LD Freeman」さんの解き方

この記事では「LD Freeman」さん（※以降、Freemanさん）の解き方に注目したいと思います。

A Data Science Framework: To Achieve 99% Accuracy

Using data from Titanic: Machine Learning from Disaster

www.kaggle.com

Freemanさんは、Codeの解説だけでなく、データ分析そのものに対する考え方や、自身の思いなども記載してくれています。また、自らの分析プロセスについては、以下目次を定義しています。

Table of Contents
Chapter 1 — How a Data Scientist Beat the Odds（Data Scientistは如何にして逆境に打ち勝つか）
Chapter 2 — A Data Science Framework（Data Scienceの手順紹介）
Chapter 3 — Step 1: Define the Problem and Step 2: Gather the Data（課題を決めて、データを集める）
Chapter 4 — Step 3: Prepare Data for Consumption（分析のためにデータを前処理する）
Chapter 5 — The 4 C’s of Data Cleaning: Correcting, Completing, Creating, and Converting（Data Cleaningのための4つの「C」）
Chapter 6 — Step 4: Perform Exploratory Analysis with Statistics（統計による探索的分析の実行）
Chapter 7 — Step 5: Model Data（データをモデリングする）
Chapter 8 — Evaluate Model Performance（モデル性能を評価する）
Chapter 9 — Tune Model with Hyper-Parameters（ハイパーパラメータによってモデルをチューニングする）
Chapter 10 — Tune Model with Feature Selection（特徴選択によるモデルのチューニング）
Chapter 11 — Step 6: Validate and Implement（評価と改善）
Chapter 12 — Conclusion and Step 7: Optimize and Strategize（最適化と戦略化）

この通りにやれば上手くいくよ、ってなもんですね。
1つ1つについては、以降で説明をしていきます。

また、Freemanさんはコメントとして、「このカーネルを横展開するのではなく、内容の理解を目標にして欲しい」と記述しています。ですので、その意向に沿い、プロセスの意味も含めて、掘下げていきたいと思います。

Chapter 1 — How a Data Scientist Beat the Odds（Data Scientistは如何にして逆境に打ち勝つか）

ここは抽象的なChapterで、これから分析を始めていきます、というような前置きを書いてくれています。

Chapter 2 — A Data Science Framework（Data Scienceの手順紹介）

ここも抽象的なChapterで、恐らく彼が考えた「Data Science Framework」というものを紹介してくれています。そのFrameworkは以下です。

1. Define the Problem
→ 課題設定をする。Technology Drivenでなく、Issue Drivenであれ。
2. Gather the Data
→ データを収集する。この時点で、Data Cleaningに気を配っておく。
3. Prepare Data for Consumption
→ 前処理を行う。Data CleaningやData Wranglingとも言うらしい。
4. Perform Exploratory Analysis
→ 探索的分析を行う。GIGOへの事前対策。相関分析。統計量の可視化。
5. Model Data
→ 機械学習を適用する。それは魔法の杖や銀の弾丸ではない。モデル選定の目利きたれ。
6. Validate and Implement Data Model
→ 学習モデルの精度検証をする。オーバーフィット、アンダーフィットに気を付ける。
7. Optimize and Strategize
→ 最適化と戦略化を実施する。実用化に向けた某を行う。

この順番に分析を進めていくと良い、という訳ですね。
尚、上記のFrameworkは目次の中に含まれています。
（Data Science Framework ⊂ Table of Contents）

Chapter 3 — Step 1: Define the Problem and Step 2: Gather the Data（課題を決めて、データを集める）

課題設定とデータ収集方法について書いてくれています。

課題：「Titanic沈没時の生存結果を予測するアルゴリズムの開発」
データ収集方法：KaggleからDownloadする。

Kaggleのデータに関しては、ここはやってくれてますね。

Chapter 4 — Step 3: Prepare Data for Consumption（分析のためにデータを前処理する）

Kaggleデータは綺麗に構造化されているので、構造化を行う当該プロセスの実施は無いと書かれています。

準備されたデータについては、所持する項目の種類毎に、以下解説が行われています。

1. 「Survived」変数は、目的変数である。生き残った場合「1」、生き残れなかった場合「0」がセットしてある。
2. 「PassengerID」・「Ticket」変数は、ランダムな一意の識別子と考え、分析から除外する。
3. 「Pclass」変数はチケットのグレードを表す変数。「1=上位クラス」、「2=中間クラス」、「3=下位クラス」となっている。
4. 「Name」変数は、名前が格納されている変数。そこに含まれる敬称（Master等）の有無が、生存に影響あるか確認する。
5. 「Sex」・「Embarked」変数は、コード値が格納されている項目であるので、ダミー化して用いる。
6. 「Age」・「Fare」変数は、連続値データであり、そのまま機械学習の入力とできる。
7. 「SibSp」変数は、乗車中の兄弟・配偶者の数を表す。「Parch」変数は、乗車中の親・子供の数を表す。どちらも連続値データである。「FamilySize」という新しい特徴を生成することに使用する。
8. 「Cabin」変数は、事故が起こった際に、線上のどの辺りにいたかという位置情報であるが、null値が多く、分析から除外する。

以上。なるほど、こういった種類のデータが「Titanic」データには含まれるのですね。そして、それぞれの状態と、機械学習を行うに当たっての対処の仕方も概ね分かったかと思います。

尚、このChapterから、具体的なコードが記載されていきます。
先ずは、必要なライブラリのimportと、データのロードです。

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python

#load packages
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))

import pandas as pd #collection of functions for data processing and analysis modeled after R dataframes with SQL like features
print("pandas version: {}". format(pd.__version__))

import matplotlib #collection of functions for scientific and publication-ready visualization
print("matplotlib version: {}". format(matplotlib.__version__))

import numpy as np #foundational package for scientific computing
print("NumPy version: {}". format(np.__version__))

import scipy as sp #collection of functions for scientific computing and advance mathematics
print("SciPy version: {}". format(sp.__version__)) 

import IPython
from IPython import display #pretty printing of dataframes in Jupyter notebook
print("IPython version: {}". format(IPython.__version__)) 

import sklearn #collection of machine learning algorithms
print("scikit-learn version: {}". format(sklearn.__version__))

#misc libraries
import random
import time


#ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)



# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix

#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

上記コードの結果が以下です。

Python version: 3.6.3 |Anaconda custom (64-bit)| (default, Nov 20 2017, 20:41:42) 
[GCC 7.2.0]
pandas version: 0.20.3
matplotlib version: 2.1.1
NumPy version: 1.13.0
SciPy version: 1.0.0
IPython version: 5.3.0
scikit-learn version: 0.19.1
-------------------------
gender_submission.csv
test.csv
train.csv<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None

結構多めのライブラリがインポートされた後に、データのロードとプレビューが行われていますね。

Chapter 5 — The 4 C’s of Data Cleaning: Correcting, Completing, Creating, and Converting（Data Cleaningのための4つの「C」）

Freemanさんは、Data Cleaningのには4つのCを実施すると言っています。
それらは以下。

1. Correcting：異常値と外れ値への対策
→ 中には、可視化を実施した後に除外するか判断するものも有り
→ 例えば、「Age=800」というデータがあったら、「Age=80」に訂正して問題ないだろう
2. Completing：欠損値の補完
→ 「Age」・「Cabin」・「Embarked」に欠損値が存在する。一部、欠損値のままで処理できるアルゴリズムもあるが、モデル同士の比較検討を考えると、補完しておいた方が都合が良い。
→ 一般的な対処方法は2つ、「レコード削除」か「妥当な値を充てがう」、妥当な値には、平均・中央値などの候補がある。
3. Creating：分析のための新特徴生成
→ 既存の特徴を用いて、新たな特徴を生成する。
4. Converting：機械学習の入力とするための数値化
→ カテゴリデータを機械学習の入力とするため、0 or 1のダミーデータに変換する。

以上。
機械学習の入力とするに当たり、上記のData Cleaningを行う必要があるという訳です。尚、このKernelでは「1. Correcting：異常値と外れ値への対策」は行っていないようでした。

Freemanさんは、「2. Completing：欠損値の補完」を行うに当たり、先ず、以下のコードでその様相を見える化しています。

print('Train columns with null values:\n', data1.isnull().sum())
print("-"*10)

print('Test/Validation columns with null values:\n', data_val.isnull().sum())
print("-"*10)

data_raw.describe(include = 'all')

上記コードの結果が以下。

Train columns with null values:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
----------
Test/Validation columns with null values:
 PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
----------

次に、以下のコードを用いて、「2. Completing：欠損値の補完」を実施しています。合わせて、不要カラムの削除も実施しています。

###COMPLETING: complete or delete missing values in train and test/validation dataset （欠損値の補完、または欠損のあるデータ行削除を、学習データ・テストデータ・提出用の検証データに対して実施する）
for dataset in data_cleaner:    
    #complete missing age with median （Ageの欠損を、中央値で補完）
    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)

    #complete embarked with mode （Embarkedの欠損を、最頻値で補完）
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)

    #complete missing fare with median （Fareの欠損を、中央値で補完）
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)

上記コードの結果が以下。

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64
----------
PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
dtype: int64

次に、以下コードにて、「3. Creating：分析のための新特徴生成」が行われています。

###CREATE: Feature Engineering for train and test/validation dataset （精度向上に向けた特徴設計を、学習データ・テストデータ・提出用の検証データに対して実施する）
for dataset in data_cleaner:
    #Discrete variables
    dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1

    dataset['IsAlone'] = 1 #initialize to yes/1 is alone
    dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1

    #quick and dirty code split title from name: （即席の汚いコードで、名前から、titleを抜き出す） http://www.pythonforbeginners.com/dictionary/python-split
    dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]


    #Continuous variable bins; qcut vs cut: （連続値を離散化する、頻度基準 vs 値域基準） https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut
    #Fare Bins/Buckets using qcut or frequency bins:（） https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
    dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)

    #Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html
    dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)


    
#cleanup rare title names （抜き出したtitleをキレイに整える）
#print(data1['Title'].value_counts())
stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/
title_names = (data1['Title'].value_counts() < stat_min) #this will create a true false series with title name as index

#apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: （apply関数とlambda関数は、即席の汚いコードで、検索hitが少ない行数の値を、任意の値で置き換えることができる） https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/
data1['Title'] = data1['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
print(data1['Title'].value_counts())
print("-"*10)


#preview data again
data1.info()
data_val.info()
data1.sample(10)

上記コードの結果が以下。

Mr        517
Miss      182
Mrs       125
Master     40
Misc       27
Name: Title, dtype: int64
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Name          891 non-null object
Sex           891 non-null object
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked      891 non-null object
FamilySize    891 non-null int64
IsAlone       891 non-null int64
Title         891 non-null object
FareBin       891 non-null category
AgeBin        891 non-null category
dtypes: category(2), float64(2), int64(6), object(4)
memory usage: 85.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 16 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           418 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
FamilySize     418 non-null int64
IsAlone        418 non-null int64
Title          418 non-null object
FareBin        418 non-null category
AgeBin         418 non-null category
dtypes: category(2), float64(2), int64(6), object(6)
memory usage: 46.8+ KB

以上。

Data Cleaningの最後に、以下コードにて、「4. Converting：機械学習の入力とするための数値化」が行われています。具体的には、コード項目のダミー化や、連続値項目の離散化などが行われています。ダミー化や、連続値項目の離散化は、共通して{0, 1}bitによる横展開のことです。尚、後者、連続値項目の離散化は、機械学習をかける上での必須プロセスではないのですが、線形分類をする手法が、非線形分類をできるようにする効果があり、実施の価値があるものです。

#CONVERT: convert objects to category using Label Encoder for train and test/validation dataset （カテゴリ分類にコード値が用いられている変数について、1-K符号化や、コードから連続値への変換を、学習データ・テストデータ・提出用の検証データに対して実施する）

#code categorical data （）
label = LabelEncoder()
for dataset in data_cleaner:    
    dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
    dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
    dataset['Title_Code'] = label.fit_transform(dataset['Title'])
    dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
    dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])


#define y variable aka target/outcome
Target = ['Survived']

#define x variables for original features aka feature selection
data1_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] #pretty name/values for charts
data1_x_calc = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code','SibSp', 'Parch', 'Age', 'Fare'] #coded for algorithm calculation
data1_xy =  Target + data1_x
print('Original X Y: ', data1_xy, '\n')


#define x variables for original w/bin features to remove continuous variables
data1_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
data1_xy_bin = Target + data1_x_bin
print('Bin X Y: ', data1_xy_bin, '\n')


#define x and y variables for dummy features original
data1_dummy = pd.get_dummies(data1[data1_x])
data1_x_dummy = data1_dummy.columns.tolist()
data1_xy_dummy = Target + data1_x_dummy
print('Dummy X Y: ', data1_xy_dummy, '\n')



data1_dummy.head()

上記コードの結果が以下です。

Original X Y:  ['Survived', 'Sex', 'Pclass', 'Embarked', 'Title', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] 

Bin X Y:  ['Survived', 'Sex_Code', 'Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code'] 

Dummy X Y:  ['Survived', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Title_Master', 'Title_Misc', 'Title_Miss', 'Title_Mr', 'Title_Mrs']

これにて、4CによるData Cleaningが完了しました。

そして、このChapterの最後に、この後の機械学習実施のために、学習データ・テストデータに分割しています。

#split train and test data with function defaults
#random_state -> seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation
train1_x, test1_x, train1_y, test1_y = model_selection.train_test_split(data1[data1_x_calc], data1[Target], random_state = 0)
train1_x_bin, test1_x_bin, train1_y_bin, test1_y_bin = model_selection.train_test_split(data1[data1_x_bin], data1[Target] , random_state = 0)
train1_x_dummy, test1_x_dummy, train1_y_dummy, test1_y_dummy = model_selection.train_test_split(data1_dummy[data1_x_dummy], data1[Target], random_state = 0)

print("Data1 Shape: {}".format(data1.shape))
print("Train1 Shape: {}".format(train1_x.shape))
print("Test1 Shape: {}".format(test1_x.shape))

上記コードの結果が以下。

Data1 Shape: (891, 19)
Train1 Shape: (668, 8)
Test1 Shape: (223, 8)

Chapter 6 — Step 4: Perform Exploratory Analysis with Statistics（統計による探索的分析の実行）

全ての変数が密に数値化されたため、ここから先は以下を実施すると記述されています。

・グラフィカル統計にてデータを調べ、変数の説明・要約を実施
・説明変数、目的変数間の相関関係を調査

要するにデータの中身をつぶさに観察し、関連性を直感的に推察しようという訳ですね。

それに当たってFreemanさんは、先ず、各変数値との相関値を、print命令にて出力しています。

以下、print出力実施コード。

#Discrete Variable Correlation by Survival using group by aka pivot table: 
for x in data1_x:
    if data1[x].dtype != 'float64' :
        print('Survival Correlation by:', x)
        print(data1[[x, Target[0]]].groupby(x, as_index=False).mean())
        print('-'*10, '\n')
        

#using crosstabs: 
print(pd.crosstab(data1['Title'],data1[Target[0]]))

上記コードの結果が以下。

Survival Correlation by: Sex
      Sex  Survived
0  female  0.742038
1    male  0.188908
---------- 

Survival Correlation by: Pclass
   Pclass  Survived
0       1  0.629630
1       2  0.472826
2       3  0.242363
---------- 

Survival Correlation by: Embarked
  Embarked  Survived
0        C  0.553571
1        Q  0.389610
2        S  0.339009
---------- 

Survival Correlation by: Title
    Title  Survived
0  Master  0.575000
1    Misc  0.444444
2    Miss  0.697802
3      Mr  0.156673
4     Mrs  0.792000
---------- 

Survival Correlation by: SibSp
   SibSp  Survived
0      0  0.345395
1      1  0.535885
2      2  0.464286
3      3  0.250000
4      4  0.166667
5      5  0.000000
6      8  0.000000
---------- 

Survival Correlation by: Parch
   Parch  Survived
0      0  0.343658
1      1  0.550847
2      2  0.500000
3      3  0.600000
4      4  0.000000
5      5  0.200000
6      6  0.000000
---------- 

Survival Correlation by: FamilySize
   FamilySize  Survived
0           1  0.303538
1           2  0.552795
2           3  0.578431
3           4  0.724138
4           5  0.200000
5           6  0.136364
6           7  0.333333
7           8  0.000000
8          11  0.000000
---------- 

Survival Correlation by: IsAlone
   IsAlone  Survived
0        0  0.505650
1        1  0.303538
---------- 

Survived    0    1
Title             
Master     17   23
Misc       15   12
Miss       55  127
Mr        436   81
Mrs        26   99

次に、以下データの分布を可視化しています。

「Fare」
「Age」
「FamilySize」

#IMPORTANT: Intentionally plotted different ways for learning purposes only. 

#graph distribution of quantitative data
plt.figure(figsize=[16,12])

plt.subplot(231)
plt.boxplot(x=data1['Fare'], showmeans = True, meanline = True)
plt.title('Fare Boxplot')
plt.ylabel('Fare ($)')

plt.subplot(232)
plt.boxplot(data1['Age'], showmeans = True, meanline = True)
plt.title('Age Boxplot')
plt.ylabel('Age (Years)')

plt.subplot(233)
plt.boxplot(data1['FamilySize'], showmeans = True, meanline = True)
plt.title('Family Size Boxplot')
plt.ylabel('Family Size (#)')

plt.subplot(234)
plt.hist(x = [data1[data1['Survived']==1]['Fare'], data1[data1['Survived']==0]['Fare']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Fare Histogram by Survival')
plt.xlabel('Fare ($)')
plt.ylabel('# of Passengers')
plt.legend()

plt.subplot(235)
plt.hist(x = [data1[data1['Survived']==1]['Age'], data1[data1['Survived']==0]['Age']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()

plt.subplot(236)
plt.hist(x = [data1[data1['Survived']==1]['FamilySize'], data1[data1['Survived']==0]['FamilySize']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Family Size Histogram by Survival')
plt.xlabel('Family Size (#)')
plt.ylabel('# of Passengers')
plt.legend()

上記コードの結果が以下です。

上段の図は、典型的な箱ヒゲ図です。最大値・最小値・四分値以外に、平均値の線もプロットされている工夫があります。外れ値の検出が同時にできていて良いですね。

下段の図は、ヒストグラムです。そして、値域毎の「Survived = 0 or 1」の割合も可視化されています。

次に、目的変数の値と、任意の説明変数の値との同時発生状況を、可視化しています。

「Embarked」✕「Survived」
「Pclass」✕「Survived」
「IsAlone」✕「Survived」
「FareBin」✕「Survived」
「AgeBin」✕「Survived」
「FamilySize」✕「Survived」

#we will use seaborn graphics for multi-variable comparison:
#graph individual features by survival
fig, saxis = plt.subplots(2, 3,figsize=(16,12))

sns.barplot(x = 'Embarked', y = 'Survived', data=data1, ax = saxis[0,0])
sns.barplot(x = 'Pclass', y = 'Survived', order=[1,2,3], data=data1, ax = saxis[0,1])
sns.barplot(x = 'IsAlone', y = 'Survived', order=[1,0], data=data1, ax = saxis[0,2])

sns.pointplot(x = 'FareBin', y = 'Survived',  data=data1, ax = saxis[1,0])
sns.pointplot(x = 'AgeBin', y = 'Survived',  data=data1, ax = saxis[1,1])
sns.pointplot(x = 'FamilySize', y = 'Survived', data=data1, ax = saxis[1,2])

上記コードの結果が以下。

上段の図は、各コード値に対応する、目的地の平均値を表しています。
例えば、「Pclass=1」のレコードは「Survived」の平均が0.6超で、「Pclass=3」のレコードは「Survived」の平均が0.3未満ということが分かる訳です。2クラス分類問題の場合は、「Survived=1」の割合と考えてOKです。
波平さんの髪のようにバー上端から上下に伸びている縦棒は、その平均値の信頼区間です。描画はseabornによって行われていますが、その内部にてBootstrapをDefaultで1,000実施し、信頼区間を算出しているようです。実際のところは、データ数が多いと信頼区間が狭くなり、データ数が少ないと信頼区間が広くなる、という感じです。或いは、平均値が大きいと信頼区間が広くなり、平均値が小さいと信頼区間が狭くなる、という感じもあります。

下段の図は、上段の図と意味は同じで、見せ方が違う形です。好きな見せ方で統一して良いと思います。

ちなみに、上記の見せ方を、上段下段逆にしてみると、以下となります。

どちらの見せ方でも良いかと思います。

次に、以下3変数間の共起状況を見ています。

「Pclass」✕「Fare」✕「Survived」
「Pclass」✕「Age」✕「Survived」
「Pclass」✕「FamilySize」✕「Survived」

#graph distribution of qualitative data: Pclass
#we know class mattered in survival, now let's compare class and a 2nd feature
fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(14,12))

sns.boxplot(x = 'Pclass', y = 'Fare', hue = 'Survived', data = data1, ax = axis1)
axis1.set_title('Pclass vs Fare Survival Comparison')

sns.violinplot(x = 'Pclass', y = 'Age', hue = 'Survived', data = data1, split = True, ax = axis2)
axis2.set_title('Pclass vs Age Survival Comparison')

sns.boxplot(x = 'Pclass', y ='FamilySize', hue = 'Survived', data = data1, ax = axis3)
axis3.set_title('Pclass vs Family Size Survival Comparison')

実は、3つの図は同じ概念を表していますが、これもやはり好みが出るところです。真ん中の図は、連続値の分布を滑らかなヒストグラムで表現していますが、両端の図は、連続値の分布を箱ひげ図で表現しています。何れにせよ表現しているものは同じです。条件を絞ったデータにおいて、「Survived = 0 or 1」がどんな様相を示しているかです。

これについても試しに見た目をShuffleしてみると、以下のようになります。

どうでしょうか？お互いに見やすいポイント、見づらいポイントがあるので、両方出すのが本当は良いかもしれません。

いい感じですね。

次に、以下の共起状況を見ている図の描画です。

「Sex」✕「Embarked」✕「Survived」
「Sex」✕「Pclass」✕「Survived」
「Sex」✕「IsAlone」✕「Survived」

#graph distribution of qualitative data: Sex
#we know sex mattered in survival, now let's compare sex and a 2nd feature
fig, qaxis = plt.subplots(1,3,figsize=(14,12))

sns.barplot(x = 'Sex', y = 'Survived', hue = 'Embarked', data=data1, ax = qaxis[0])
axis1.set_title('Sex vs Embarked Survival Comparison')

sns.barplot(x = 'Sex', y = 'Survived', hue = 'Pclass', data=data1, ax  = qaxis[1])
axis1.set_title('Sex vs Pclass Survival Comparison')

sns.barplot(x = 'Sex', y = 'Survived', hue = 'IsAlone', data=data1, ax  = qaxis[2])
axis1.set_title('Sex vs IsAlone Survival Comparison')

この図のポイントは、「コード値」✕「コード値」✕「コード値」の共起を表していることです。2変数の同時発生性が重要である場合のインサイト発掘には、非常に役立つプロットであると思います。

次に、以下の共起状況を見ている図の描画です。

「Sex」✕「FamilySize」✕「Survived」
「Sex」✕「Pclass」✕「Survived」

#more side-by-side comparisons
fig, (maxis1, maxis2) = plt.subplots(1, 2,figsize=(14,12))

#how does family size factor with sex & survival compare
sns.pointplot(x="FamilySize", y="Survived", hue="Sex", data=data1,
              palette={"male": "blue", "female": "pink"},
              markers=["*", "o"], linestyles=["-", "--"], ax = maxis1)

#how does class factor with sex & survival compare
sns.pointplot(x="Pclass", y="Survived", hue="Sex", data=data1,
              palette={"male": "blue", "female": "pink"},
              markers=["*", "o"], linestyles=["-", "--"], ax = maxis2)

基本的には、先程の3点共起バーグラフと同じ表現方法です。好みに合わせて使用するのが良いかと思います。

次に、以下の4点共起表現です。（4点とまで行くと、その図を見る人の脳みそ負荷が大分大きくなってきます…。）

「Sex」✕「Pclass」✕「Embarked」✕「Survived」

#how does embark port factor with class, sex, and survival compare
#facetgrid: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
e = sns.FacetGrid(data1, col = 'Embarked')
e.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', ci=95.0, palette = 'deep')
e.add_legend()

実際には、3点共起の図を、条件別に出している感じですね。この考え方を適用すれば、現実的には何点共起まででも、表現ができそうです。

次に、連続値「Age」についての存在確率分布を、「Survived = 0 or 1」別に表示している図です。

#plot distributions of age of passengers who survived or did not survive
a = sns.FacetGrid( data1, hue = 'Survived', aspect=4 )
a.map(sns.kdeplot, 'Age', shade= True )
a.set(xlim=(0 , data1['Age'].max()))
a.add_legend()

次に、以下の4点共起状態の可視化です。

「Sex」✕「Pclass」✕「Age」✕「Survived」

#histogram comparison of sex, class, and age by survival
h = sns.FacetGrid(data1, row = 'Sex', col = 'Pclass', hue = 'Survived')
h.map(plt.hist, 'Age', alpha = .75)
h.add_legend()

次に、ペアプロットです。

#pair plots of entire dataset
pp = sns.pairplot(data1, hue = 'Survived', palette = 'deep', size=1.2, diag_kind = 'kde', diag_kws=dict(shade=True), plot_kws=dict(s=10) )
pp.set(xticklabels=[])

これは、ちょっと見づらいですね💦

次に、相関分析Heatmapです。

#correlation heatmap of dataset
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = colormap,
        square=True, 
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='white',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features', y=1.05, size=15)

correlation_heatmap(data1)

相関分析は、流儀によっては、絶対値だけで見る人もいます。それは、「df.corr()」の後ろに「df.corr().abs()」と添えれば実現できます。

相関の有無だけを見たいのであれば、絶対値表現の方が明瞭です。

と、ここまでで、「Step 4: Perform Exploratory Analysis with Statistics」が終了です。

Chapter 7 — Step 5: Model Data（データをモデリングする）

さて、データの準備ができたところで、いよいよ機械学習の適用です。

尚、Freemanさんは、このChapter冒頭、以下ポイントを語っております。

・Data Scienceは、「mathematics」「computer science」「business management」の間の学際的分野
・殆どの人が、何れかの分野に寄りかかる傾向がある
・上記3つは、3本のスツールのようなもので、どれが欠けてもよろしくない
・数学の知識については、一先ずは「高レベルな概要」で大丈夫
・問題を捉えるビジネス的な洞察力が必要
・結局のところ、盲導犬のトレーニングのように、我々から学ぶことであり、その逆ではない
・エントリの障壁が低くなると、使用ツールへの理解が低くなり、誤った結論が出る可能性があり、最悪の場合はプロジェクトの完成を不可能にする
・kernelを通して学ぶ重要なことは、何をするのかよりも、何故するのかである
・強化学習は、教師あり学習と、教師なし学習のハイブリッド（モデルはすぐに正しい答えを与えられるのではなく、一連のイベントの後に与えられる）
・ロジスティック回帰は名前に回帰がありますが、実際は分類アルゴリズムです

以上。なるほど、学びが多いですね…。前の説明でFreemanさんは「機械学習は、魔法の杖や銀の弾丸ではない」とも言っていました。ここでは、その点について、もう少し掘下げています。

日本のデータサイエンティスト協会も、以下提示をしていますが、それを似た議論ですね。

或いは、以下も似た考え方ですね。

そして、Freemanさんは更に、「Data Science 101：How to Choose a Machine Learning Algorithm (MLA)」と題して、以下のポイントも語っています。（「101」は、「初級編・入門者向けの・基礎編」という意味）

・データモデリングについては、最良のアルゴリズムを事前に推し量ることができないという「No Free Lunch Theorem」が存在、全てのデータセットに機能するスーパーアルゴリズムはない
・複数の機械学習手法を試して、精度比較するのが良い
・データ量がものを言う派や、アルゴリズムがものを言う派がいる
・決定木、Random Forest、Boostingが初心者にはオススメ
・決定木は特に、直感的に理解が容易い

以上。なるほど、Try&Error的に複数のモデルを試してみることが大事だということですね。研究という感じですね。決定木系のアルゴリズムを使った方が良いというのは分かる気がしますね。回帰分析を元としたアルゴリズムは、ベースの考え方が確かに結構難しいですからね。

そして、このChapterにて、ついに学習が実施されます。コードは以下になります。

#Machine Learning Algorithm (MLA) Selection and Initialization
MLA = [
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),
    
    #GLM
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    
    #SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    svm.LinearSVC(),
    
    #Trees    
    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(),
    
    #Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(),

    
    #xgboost 
    XGBClassifier()    
    ]#split dataset in cross-validation with this splitter class: 
#note: this is an alternative to train_test_split
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 ) # run model 10x with 60/30 split intentionally leaving out 10%

#create table to compare MLA metrics
MLA_columns = ['MLA Name', 'MLA Parameters','MLA Train Accuracy Mean', 'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD' ,'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)

#create table to compare MLA predictions
MLA_predict = data1[Target]

#index through MLA and save performance to table
row_index = 0
for alg in MLA:

    #set name and parameters
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
    MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
    
    #score model with cross validation
    cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv  = cv_split)

    MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
    MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()
    MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean()   
    #if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets
    MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3   #let's know the worst that can happen!
    

    #save MLA predictions - see section 6 for usage
    alg.fit(data1[data1_x_bin], data1[Target])
    MLA_predict[MLA_name] = alg.predict(data1[data1_x_bin])
    
    row_index+=1

    
#print and sort table
MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)
MLA_compare
#MLA_predict

XGBoostがテストデータにて性能がTOP、という結果になりました。

尚、補足ですが、私の環境ですと、XGBoostの学習処理の前に「os.environ[‘KMP_DUPLICATE_LIB_OK’]=’True’ # for XGBoost」というコードを実施しておかないと、処理が落ちてしまうようでした。Jupyterが

上記、表出力の結果ですが、以下のコードで直感的に描画されます。

#barplot using https://seaborn.pydata.org/generated/seaborn.barplot.html
sns.barplot(x='MLA Test Accuracy Mean', y = 'MLA Name', data = MLA_compare, color = 'm')

#prettify using pyplot: https://matplotlib.org/api/pyplot_api.html
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score (%)')
plt.ylabel('Algorithm')

Chapter 8 — Evaluate Model Performance（モデル性能を評価する）

ここまでで、乗客の生存を約82%で予測できました。ここから先の作業について、Freemanさんは以下のように語っております。

・先ず、これまでの結果で、悪くはなかろう
・ここから先は、あなたが研究職でない限りは、ROI（投資収益率）を考えなくてはならない
・例えば、3ヶ月で0.1%の精度向上は、ビジネス的にはよろしくない
・モデル改良をする時は、それを肝に命じる

とのことです。言いたいことはよく分かりますね。要するには、QCDという観点です。QCDは、Quality・Cost・Deliveryのバランス担保の話です。Quality Firstは勿論ですが、それに対するCostやDeliveryが、あまりにも見合わない場合はビジネスとして上手くありません。その点を気を付けろということを、Freemanさんは言ってくれています。

そして、ここで「Data Science 101: Determine a Baseline Accuracy」と題して、以下説明をしてくれています。

・2クラス問題であり、かつ、「Survived = 0」の数が1,502/2,224であることを我々は知っている為、最低精度は全てのデータを「Survived = 0」と予測した際の、67.5%である
・ベースラインとする精度は、0%でもなく、50%でもなく、68%としよう

確かに、仰る通りですね。不均衡データの場合は、更にその傾向が出ます。ちなみに、注意が必要なのは、上記の「1,502/2,224」という数は、データからの数字ではなく、wikiなどに載っている実際の結果です。

RMS Titanic - Wikipedia

RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in 1912 after the ship struck an…

en.wikipedia.org

尚、死亡者数は諸説あるらしく、freemanさんが1,502という数字を出している根拠はちょっと分かりませんでした。

続けて「Data Science 101: How-to Create Your Own Model」と題し、以下の説明をしてくれています。

更に精度が向上するかどうかを検証するためには、データに対して、例えば、以下のような直感に基づいて、手動で決定木を作ってみると良い。Excelでの実施も良いでしょう。無頼で作ったその決定木の精度を5段階評価「worst, bad, good, better, best」のgoodと置き、そこをベースに議論をすることで、機械が求めた予測器の精度の妥当性が分かります。少なくとも手動のものよりは良い筈です。また、何をどうすれば精度が出るかを考えるプロセスを経ることで、機械学習にて行われている内容への理解も深まります。
（※結構、個人的な解釈を加えています…。）
Question 1：あなたはタイタニック号に乗っていましたか？
→ YESであれば、学習データの「Survived = 0 or 1」比率から、少なくとも62%の精度が出せる（学習データの「Survived = 0 or 1」は「549（62%）：342（38%）」という比率）
Question 2：あなたは男性ですか？女性ですか？
→ 学習データにおいて、男性で死亡した人は81%、女性の生存した人は74%である為、その情報から、男性比率65%✕81%＋女性比率35%✕74%=79%の精度は少なくとも出せる
Question 3A：女性である人の内、乗客クラス（Pclass）は1,2,3の何れですか？
→ 学習データ上、乗客クラス1の人は97%が生存し、乗客クラス2の人は92%が生存しており死亡者は10人以下、かつ、乗客クラス3は生存者と死亡者が50-50である為、このQuestionによる精度向上は無かった
Question 4A：女性である、かつ、乗客クラス（Pclass）が3である人の内、出港地（Embarked）はC,Q,Sの何れですか？
→ 出港地がC,Qの人は大多数が生存した為、それによる精度向上は無いが、出港地がSの人は63%が死亡した為、その分岐によって精度が81%に向上する
Question 5A：女性である、かつ、乗客クラス（Pclass）が3である、かつ、出港地（Embarked）がSである人の内、…
→ この質問から先は、色々試して遊んでみると、運賃0-8で大多数が生き残るという分岐を作ることができ、精度を82%に向上することができる
Question 3B：男性の内、…
→ 男性側には、女性に見られるような、以降の条件による効果的な分岐が見当たらなかった

なるほど、ここの部分はメチャクチャ示唆に富んでます…。神です…。私の知っている限りでも、「機械学習にかける→XX%の精度が出た→XX%の精度が限界である」というように、淡白に結論に至るData Scientistは少なくありません。それは、Scienceではなく、ツール利用ではないかという…。Data Scientistたるもの、全てのプロセスに対するWhyを説明することが理想だと思いますが、Freemanさんはその辺りをちゃんと伝えようとしれくれています。

そして、実際に手作りのモデルをスクラッチでコーディングしてくれています。そして、上記の御手製決定木の前段として、先ず、完全ランダム予測の「Coin Flip Model」を紹介しています。以下がそのコードです。

#IMPORTANT: This is a handmade model for learning purposes only.
#However, it is possible to create your own predictive model without a fancy algorithm :)#coin flip model with random 1/survived 0/died#iterate over dataFrame rows as (index, Series) pairs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html
for index, row in data1.iterrows(): 
    #random number generator: https://docs.python.org/2/library/random.html
    if random.random() > .5:     # Random float x, 0.0 <= x < 1.0    
        data1.set_value(index, 'Random_Predict', 1) #predict survived/1
    else: 
        data1.set_value(index, 'Random_Predict', 0) #predict died/0#score random guess of survival. Use shortcut 1 = Right Guess and 0 = Wrong Guess
#the mean of the column will then equal the accuracy
data1['Random_Score'] = 0 #assume prediction wrong
data1.loc[(data1['Survived'] == data1['Random_Predict']), 'Random_Score'] = 1 #set to 1 for correct prediction
print('Coin Flip Model Accuracy: {:.2f}%'.format(data1['Random_Score'].mean()*100))#we can also use scikit's accuracy_score function to save us a few lines of code
#http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
print('Coin Flip Model Accuracy w/SciKit: {:.2f}%'.format(metrics.accuracy_score(data1['Survived'], data1['Random_Predict'])*100))

上記コードの結果が以下です。

Coin Flip Model Accuracy: 47.81%
Coin Flip Model Accuracy w/SciKit: 47.81%

「Coin Flip Model」は全ての予測を、Randomに「0 or 1」で決めるというモデルです。0側にも1側にも偏ることなくRandomに決めます。私も昔、テスト回答を鉛筆転がして決めたことがありますが、それと一緒ですね。😅

上記プログラムの結果は、正解率50%です。私の手元で、上記プログラムを1,000回施行してみましたが、都度都度は正解率が少し上下するものの、その平均を取ると50%（正確には、49.963%）となりました。予測1つ1つの正解率の期待値が50%な訳ですから、それを複数データ分実施しても、正解率の期待値は変わらず50%ですよね。

次に、御手製決定木のスクラッチコーディングです。コードは以下です。

#handmade data model using brain power (and Microsoft Excel Pivot Tables for quick calculations)
def mytree(df):
    
    #initialize table to store predictions
    Model = pd.DataFrame(data = {'Predict':[]})
    male_title = ['Master'] #survived titles

    for index, row in df.iterrows():

        #Question 1: Were you on the Titanic; majority died
        Model.loc[index, 'Predict'] = 0

        #Question 2: Are you female; majority survived
        if (df.loc[index, 'Sex'] == 'female'):
                  Model.loc[index, 'Predict'] = 1

        #Question 3A Female - Class and Question 4 Embarked gain minimum information

        #Question 5B Female - FareBin; set anything less than .5 in female node decision tree back to 0       
        if ((df.loc[index, 'Sex'] == 'female') & 
            (df.loc[index, 'Pclass'] == 3) & 
            (df.loc[index, 'Embarked'] == 'S')  &
            (df.loc[index, 'Fare'] > 8)

           ):
                  Model.loc[index, 'Predict'] = 0

        #Question 3B Male: Title; set anything greater than .5 to 1 for majority survived
        if ((df.loc[index, 'Sex'] == 'male') &
            (df.loc[index, 'Title'] in male_title)
            ):
            Model.loc[index, 'Predict'] = 1
        
        
    return Model


#model data
Tree_Predict = mytree(data1)
print('Decision Tree Model Accuracy/Precision Score: {:.2f}%\n'.format(metrics.accuracy_score(data1['Survived'], Tree_Predict)*100))


#Accuracy Summary Report with http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report
#Where recall score = (true positives)/(true positive + false negative) w/1 being best:http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
#And F1 score = weighted average of precision and recall w/1 being best: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
print(metrics.classification_report(data1['Survived'], Tree_Predict))

上記コードの結果が以下です。

Decision Tree Model Accuracy/Precision Score: 82.04%

             precision    recall  f1-score   support

          0       0.82      0.91      0.86       549
          1       0.82      0.68      0.75       342

avg / total       0.82      0.82      0.82       891

正解率が「Score」になりますが、82.04%となっています。事前にFreemanさんが調べていた通りです。

次に、結果をサマリするコードです。

#Plot Accuracy Summary
#Credit: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Compute confusion matrix
cnf_matrix = metrics.confusion_matrix(data1['Survived'], Tree_Predict)
np.set_printoptions(precision=2)

class_names = ['Dead', 'Survived']
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True, 
                      title='Normalized confusion matrix')

そして、上記コードの結果が以下です。

上記結果を見ると、「Survived ＝ 0」と予測して、実際に「Survived ＝ 0」だった場合の正解率が91%と高く、「Survived ＝ 1」と予測して、実際に「Survived ＝ 1」だった場合の正解率がそれに劣る68%であることが分かります。

尚、上記の精度は、学習データとテストデータに分割することなく導出した精度です。クロスバリデーションも実施していません。ちなみに、前Chapterの「Chapter 7 — Step 5: Model Data」にて、出された精度はクロスバリデーションを行って出した精度でした。

Freemanさんはその点について、ここで触れています。

・前Chapterの「Chapter 7 — Step 5: Model Data」では、sklearnのcross_validate関数を使い、モデルのパフォーマンスをトレーニング、テストに分けて評価していた
・モデル構築には、学習データとテストデータの分けが非常に重要である
・機械学習は、既に見たことのあるデータを「予測する」のは得意だが、見たことのないデータについては保証がなく（ひどいことになるかも）、保証がないということは、汎用的な意味での「予測」になっていないことになる
・Cross Validationは、複数回分割したデータにて、別々にモデルを作り、それ毎に精度評価を行うことで、未知のデータに対する精度を見積もるものである（誤った自信を持たないために重要）
・私が行っているCross Validationは、「customized sklearn train test splitter」である

そして、通常のCross Validationのイメージとして、以下のイメージを添えています。

Chapter 9 — Tune Model with Hyper-Parameters（ハイパーパラメータによってモデルをチューニングする）

次に、ハイパーパラメータチューニングのChapterになります。この章の冒頭、Freemanさんは以下のように語っております。

・前Chapterの「Chapter 7 — Step 5: Model Data」にて、sklearnのDecisionTreeClassifierは、Defaultのハイパーパラメータにて学習を行った為、パラメータチューニングの余地が残っている
・パラメータのチューニングをする際には、その意味を理解している必要がある為、決定木のアルゴリズムについて、ここでもう少し解説する

ハイパーパラメータとは、機械学習アルゴリズムの挙動を制御するパラメータのことです。また、「No Free Lunch Theorem」は、ハイパーパラメータについても、それが言えます。「No Free Lunch Theorem」とは、実際にやってみるまで、精度が高くなるアルゴリズムが推定できないことです。ハイパーパラメータについても、精度が最も高くなるポイントは、事前に推定することができないのです。その為、精度追求のためには、Trial&Error的に色々なパラメータを試す必要があります。

そして、決定木の説明を以下のようにされています。メリット・デメリットを軸にした話ですね。

決定木のメリットは以下の通り
・理解しやすく、解釈しやすい
・値の大小スケールを調整する必要が無い
・sklearnの決定木は、欠損値をサポートしていないので、フォローする必要がある
・予測の法則性がホワイトボックスである（条件の分岐で説明できる）
決定木のデメリットは以下の通り
・木の数（或いは、木の深さ）を増やし過ぎると、オーバーフィットする
・オーバーフィットを防ぐための「枝刈り」は、sklearnではフォローされていない
・「枝刈り」以外にもオーバーフィット対策は幾つか存在し、sklearnにはそれらの内の主要なものが実装されている
・局所的に最適な境界を求めるような発見的、かつ、貪欲なアルゴリズムである為、Globalにロバストである保証はない（外挿に弱い）
・データのわずかな変動に対する過敏反応や、Globalにロバストである保証がない点は、アンサンブル（要は、Random Forestにする）で軽減でき得る
・学習データにおけるクラス別データ数に偏りがあると、データ数が多いクラスが優勢に作られる（データ数の多いクラスばかりを、予測しがちになる）ので、学習前にバランスを取っておくと良い

なるほど、確かに仰る通りですね。

ちなみに私も、決定木のアルゴリズムについて、「GitHubコードから読み解く！DecisionTreeClassifier@scikit-learnの各種オプション解説」という件で解説をしておりますので、もしよかったら参照下さい。

GitHubコードから読み解く！DecisionTreeClassifier@scikit-learnの各種オプション解説

はじめに

medium.com

そして、決定木のパラメータを調整した上での学習実施、及び、精度測定のコードを以下のように記載されています。

#base model
dtree = tree.DecisionTreeClassifier(random_state = 0)
base_results = model_selection.cross_validate(dtree, data1[data1_x_bin], data1[Target], cv  = cv_split)
dtree.fit(data1[data1_x_bin], data1[Target])

print('BEFORE DT Parameters: ', dtree.get_params())
print("BEFORE DT Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100)) 
print("BEFORE DT Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100))
print("BEFORE DT Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3))
#print("BEFORE DT Test w/bin set score min: {:.2f}". format(base_results['test_score'].min()*100))
print('-'*10)


#tune hyper-parameters: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
param_grid = {'criterion': ['gini', 'entropy'],  #scoring methodology; two supported formulas for calculating information gain - default is gini
              #'splitter': ['best', 'random'], #splitting methodology; two supported strategies - default is best
              'max_depth': [2,4,6,8,10,None], #max depth tree can grow; default is none
              #'min_samples_split': [2,5,10,.03,.05], #minimum subset size BEFORE new split (fraction is % of total); default is 2
              #'min_samples_leaf': [1,5,10,.03,.05], #minimum subset size AFTER new split split (fraction is % of total); default is 1
              #'max_features': [None, 'auto'], #max features to consider when performing split; default none or all
              'random_state': [0] #seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation
             }

#print(list(model_selection.ParameterGrid(param_grid)))

#choose best model with grid_search: #http://scikit-learn.org/stable/modules/grid_search.html#grid-search
#http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
tune_model.fit(data1[data1_x_bin], data1[Target])

#print(tune_model.cv_results_.keys())
#print(tune_model.cv_results_['params'])
print('AFTER DT Parameters: ', tune_model.best_params_)
#print(tune_model.cv_results_['mean_train_score'])
print("AFTER DT Training w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100)) 
#print(tune_model.cv_results_['mean_test_score'])
print("AFTER DT Test w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100))
print("AFTER DT Test w/bin score 3*std: +/- {:.2f}". format(tune_model.cv_results_['std_test_score'][tune_model.best_index_]*100*3))
print('-'*10)


#duplicates gridsearchcv
#tune_results = model_selection.cross_validate(tune_model, data1[data1_x_bin], data1[Target], cv  = cv_split)

#print('AFTER DT Parameters: ', tune_model.best_params_)
#print("AFTER DT Training w/bin set score mean: {:.2f}". format(tune_results['train_score'].mean()*100)) 
#print("AFTER DT Test w/bin set score mean: {:.2f}". format(tune_results['test_score'].mean()*100))
#print("AFTER DT Test w/bin set score min: {:.2f}". format(tune_results['test_score'].min()*100))
#print('-'*10)

上記コードの結果が以下です。

BEFORE DT Parameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': False, 'random_state': 0, 'splitter': 'best'}
BEFORE DT Training w/bin score mean: 89.51
BEFORE DT Test w/bin score mean: 82.09
BEFORE DT Test w/bin score 3*std: +/- 5.57
----------
AFTER DT Parameters:  {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
AFTER DT Training w/bin score mean: 89.35
AFTER DT Test w/bin score mean: 87.40
AFTER DT Test w/bin score 3*std: +/- 5.00
----------

グリッドサーチにて、パラメータのチューニングを行った結果、Cross Validation 10回における、テストデータでの平均正解率が、なんと87.40%となりました。人間の主観による82.04%が、結構突き放されています。

また、Cross Validationにて複数回精度検証を行っていることを利用して、正解率の±3σ値も出してくれています。これは、面白い表現ですね！

Chapter 10 — Tune Model with Feature Selection（特徴選択によるモデルのチューニング）

ここで特徴選択というワードが出てきます。特徴選択について、Freemanさんは以下のように触れています。

・より多くの特徴（次元数の多い説明変数）から作るモデルは、過学習している危険性がある
・それを防ぐためのデータモデリングのもう1ステップとして、特徴選択がある
・Sklearnには幾つかの特徴選択機能が用意されているが、ここではCross Validationと共にRFE（Recursive Feature Elimination）を使用する

ここで、特徴選択について補足をします。特徴選択とは、予測精度の向上に寄与しない特徴を削るアプローチです。それにより、一般的にテストデータでの予測精度向上が見込めます。以下の図がその象徴的な例です。

特徴を1つ削る度にAccuracyを測定して、その推移を測定したのが上記の図です。元々37特徴あったものを減らしていって、9特徴になった時に精度が一番高くなっています。ただし、それより更に特徴を減らしていった際には、精度が相対的にガクッと低下しています。テストデータでの精度は、このような右側が少し膨らんだ感じの凸型グラフ形状になります。お山の頭が、精度を引き出すのに最適な特徴数という訳です。

ちなみに、学習データについては、一般に、特徴を増やすほど精度が高くなり、減らすほど精度が低くなる傾向があります。対数グラフのような感じですね。その点が、テストデータの精度傾向とは異なります。

そんな特徴選択を、Freemanさんは以下コードにて行っています。特徴選択手法としては、RFEを適用しています。RFEは、「特徴の重要度算出」と「低重要度の特徴の削除」を繰り返す手法です。上記図のように1つずつ削る場合もあれば、5個ずつ・10個ずつなど、一気に削る場合もあります。Freemanさんの場合は、「step=1」と設定しているので、特徴を1つずつ削っている形です。

#base model（基準となるモデルの精度）
print('BEFORE DT RFE Training Shape Old: ', data1[data1_x_bin].shape) 
print('BEFORE DT RFE Training Columns Old: ', data1[data1_x_bin].columns.values)

print("BEFORE DT RFE Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100)) 
print("BEFORE DT RFE Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100))
print("BEFORE DT RFE Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3))
print('-'*10)



#feature selection（特徴選択の実施）
dtree_rfe = feature_selection.RFECV(dtree, step = 1, scoring = 'accuracy', cv = cv_split)
dtree_rfe.fit(data1[data1_x_bin], data1[Target])

#transform x&y to reduced features and fit new model（xとyを縮小された特徴に変換して新しいモデルに合わせる）
#alternative: can use pipeline to reduce fit and transform steps（ちなみに代替案：pipelineを使うと、fitとtransformを減らすことができる）: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
X_rfe = data1[data1_x_bin].columns.values[dtree_rfe.get_support()]
rfe_results = model_selection.cross_validate(dtree, data1[X_rfe], data1[Target], cv  = cv_split)

#print(dtree_rfe.grid_scores_)
print('AFTER DT RFE Training Shape New: ', data1[X_rfe].shape) 
print('AFTER DT RFE Training Columns New: ', X_rfe)

print("AFTER DT RFE Training w/bin score mean: {:.2f}". format(rfe_results['train_score'].mean()*100)) 
print("AFTER DT RFE Test w/bin score mean: {:.2f}". format(rfe_results['test_score'].mean()*100))
print("AFTER DT RFE Test w/bin score 3*std: +/- {:.2f}". format(rfe_results['test_score'].std()*100*3))
print('-'*10)


#tune rfe model（RFEモデルをチューニングする）
rfe_tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
rfe_tune_model.fit(data1[X_rfe], data1[Target])

#print(rfe_tune_model.cv_results_.keys())
#print(rfe_tune_model.cv_results_['params'])
print('AFTER DT RFE Tuned Parameters: ', rfe_tune_model.best_params_)
#print(rfe_tune_model.cv_results_['mean_train_score'])
print("AFTER DT RFE Tuned Training w/bin score mean: {:.2f}". format(rfe_tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100)) 
#print(rfe_tune_model.cv_results_['mean_test_score'])
print("AFTER DT RFE Tuned Test w/bin score mean: {:.2f}". format(rfe_tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100))
print("AFTER DT RFE Tuned Test w/bin score 3*std: +/- {:.2f}". format(rfe_tune_model.cv_results_['std_test_score'][tune_model.best_index_]*100*3))
print('-'*10)

上記コードの結果が以下となります。

BEFORE DT RFE Training Shape Old:  (891, 7)
BEFORE DT RFE Training Columns Old:  ['Sex_Code' 'Pclass' 'Embarked_Code' 'Title_Code' 'FamilySize'
 'AgeBin_Code' 'FareBin_Code']
BEFORE DT RFE Training w/bin score mean: 89.51
BEFORE DT RFE Test w/bin score mean: 82.09
BEFORE DT RFE Test w/bin score 3*std: +/- 5.57
----------
AFTER DT RFE Training Shape New:  (891, 6)
AFTER DT RFE Training Columns New:  ['Sex_Code' 'Pclass' 'Title_Code' 'FamilySize' 'AgeBin_Code' 'FareBin_Code']
AFTER DT RFE Training w/bin score mean: 88.16
AFTER DT RFE Test w/bin score mean: 83.06
AFTER DT RFE Test w/bin score 3*std: +/- 6.22
----------
AFTER DT RFE Tuned Parameters:  {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
AFTER DT RFE Tuned Training w/bin score mean: 89.39
AFTER DT RFE Tuned Test w/bin score mean: 87.34
AFTER DT RFE Tuned Test w/bin score 3*std: +/- 6.21
----------

パラメータチューニングによって、87.40%まで向上した精度が、特徴選択によって87.34となりました。少し、精度が低下してしまいました。しかし、特筆すべきは、使用している特徴の数が、7から6へと減少していることです。何かを判定するための予測機を作るとして、それを実運用する場合、特徴選択によって精度が向上、または、そう変わらずにStayする場合には、少ない特徴で運用ができる訳ですから、運用コストの面で嬉しさがあります。また、Titanicデータセットは、元々の特徴数が割と少ないので、上記結果程度の特徴数削減しかできませんでしたが、実際のデータ分析においては、データがもっと多数に及ぶことが多く、特徴数の削減が1/10、1/100といったスケールで実現できることもあります。

尚、私の手元にて、もっと細かくパラメータチューニングをしてみたところ、RFE前で87.62%、RFE後で87.60%となりました。少しだけの精度向上ですね。

そして、この後、Freemanさんは作成した決定木のグラフを可視化しています。可視化のためのコードは以下です。

#Graph MLA version of Decision Tree（決定木グラフの可視化）: http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
import graphviz 
dot_data = tree.export_graphviz(dtree, out_file=None, 
                                feature_names = data1_x_bin, class_names = True,
                                filled = True, rounded = True)
graph = graphviz.Source(dot_data) 
graph

上記コードの結果が以下です。

なるほど、条件がスゴい張り巡らされていますね。人が手動で条件を組む場合、これほどまでに複雑な構造を組むのは、ちょっと難しそうですね。

Chapter 11 — Step 6: Validate and Implement（評価と改善）

次に、テストデータにて、複数の機械学習アルゴリズムを組み合わせるVotingというものをFreemanさんは実施しています。

sklearn.ensemble.VotingClassifier - scikit-learn 0.21.2 documentation

If 'hard', uses predicted class labels for majority rule voting. Else if 'soft', predicts the class label based on the…

scikit-learn.org

複数アルゴリズムによる予測結果の平均を取ったり、それらの多数決によって予測結果を決定したりする方法です。

その実施を検討するに当たって、先ず、各アルゴリズムの予測結果の相関を観察します。

#compare algorithm predictions with each other, where 1 = exactly similar and 0 = exactly opposite（各アルゴリズムの予測結果を互いに比較します。ここで、1は「予測結果が完全に類似している」という意味で、0は「予測結果が完全に反転している」という意味です。）
#there are some 1's, but enough blues and light reds to create a "super algorithm" by combining them（図の上で、薄い赤色や、青色であるアルゴリズムのペアは、それらを組み合わせることにより「スーパーアルゴリズム」を作成する）
correlation_heatmap(MLA_predict)

上記コードの結果が以下です。

このマトリックスは、対角のAll 1のラインを軸に、対象となっています。一番左の列、及び、一番上の行が、実際の答えと、各アルゴリズムの予測値との相関値を表している為、そこに高い数値が記載されているアルゴリズムが、単体での発揮精度が高いアルゴリズムとなります。また、各アルゴリズム間での相関値が低いペアについては、同一データに対して異なる予測を行っているということですので、ensembleすると良い結果を生む可能性があります。

その発想から、Freemanさんはこの後、以下のアルゴリズムを組合せたensembleを実施しています。

・AdaBoost
・scikit-learnが提供する幾つかの「決定木 ✕ ensemble」手法
・Gradient Boosting
・XGBoost
・Logistic回帰
・線形回帰
・ナイーブベイス（ベルヌーイ分布ベース、正規分布ベース）
・K-NearestNeighbor
・Support Vector Machine

そして、上記アルゴリズムらのensembleを実現するコードが以下です。

#why choose one model, when you can pick them all with voting classifier
#http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html
#removed models w/o attribute 'predict_proba' required for vote classifier and models with a 1.0 correlation to another model
vote_est = [
    #Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
    ('ada', ensemble.AdaBoostClassifier()),
    ('bc', ensemble.BaggingClassifier()),
    ('etc',ensemble.ExtraTreesClassifier()),
    ('gbc', ensemble.GradientBoostingClassifier()),
    ('rfc', ensemble.RandomForestClassifier()),

    #Gaussian Processes: http://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-classification-gpc
    ('gpc', gaussian_process.GaussianProcessClassifier()),
    
    #GLM: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
    ('lr', linear_model.LogisticRegressionCV()),
    
    #Navies Bayes: http://scikit-learn.org/stable/modules/naive_bayes.html
    ('bnb', naive_bayes.BernoulliNB()),
    ('gnb', naive_bayes.GaussianNB()),
    
    #Nearest Neighbor: http://scikit-learn.org/stable/modules/neighbors.html
    ('knn', neighbors.KNeighborsClassifier()),
    
    #SVM: http://scikit-learn.org/stable/modules/svm.html
    ('svc', svm.SVC(probability=True)),
    
    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
   ('xgb', XGBClassifier())

]


#Hard Vote or majority rules
vote_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
vote_hard_cv = model_selection.cross_validate(vote_hard, data1[data1_x_bin], data1[Target], cv  = cv_split)
vote_hard.fit(data1[data1_x_bin], data1[Target])

print("Hard Voting Training w/bin score mean: {:.2f}". format(vote_hard_cv['train_score'].mean()*100)) 
print("Hard Voting Test w/bin score mean: {:.2f}". format(vote_hard_cv['test_score'].mean()*100))
print("Hard Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_hard_cv['test_score'].std()*100*3))
print('-'*10)


#Soft Vote or weighted probabilities
vote_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
vote_soft_cv = model_selection.cross_validate(vote_soft, data1[data1_x_bin], data1[Target], cv  = cv_split)
vote_soft.fit(data1[data1_x_bin], data1[Target])

print("Soft Voting Training w/bin score mean: {:.2f}". format(vote_soft_cv['train_score'].mean()*100)) 
print("Soft Voting Test w/bin score mean: {:.2f}". format(vote_soft_cv['test_score'].mean()*100))
print("Soft Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_soft_cv['test_score'].std()*100*3))
print('-'*10)

上記コードの結果が以下です。

Hard Voting Training w/bin score mean: 86.59
Hard Voting Test w/bin score mean: 82.39
Hard Voting Test w/bin score 3*std: +/- 4.95
----------
Soft Voting Training w/bin score mean: 87.15
Soft Voting Test w/bin score mean: 82.35
Soft Voting Test w/bin score 3*std: +/- 4.85
----------

テストデータの精度が82.39%、82.35%となりました。リッチに実施した結果が、決定木単品に負けてしまっていますね。正に、No Free Lunch Thoremという感じでしょうか。尚、上記における、「Hard Voting」と「Soft Voting」について補足します。

「Hard Voting」は、複数アルゴリズムの多数決によって、「Survived = 0 or 1」を決定するものです。例えば、アルゴリズムが10個あって、7個のアルゴリズムが「Survived = 1」と判定してれば、ensemble予測器としての唯一の回答は「Survived = 1」となります。

ちなみに、アルゴリズムが10個あって、5個のアルゴリズムが「Survived = 1」と判定した場合、つまり、「Survived = 0 or 1」がアルゴリズム間で同数ずつの投票であった場合は、クラスラベルのアルファベット順によって決められるそうです。以下、その仕様についてのリンクです。

1.11. Ensemble methods - scikit-learn 0.21.2 documentation

In ensemble algorithms, bagging methods form a class of algorithms which build several instances of a black-box…

scikit-learn.org

「Soft Voting」については、各アルゴリズムが出した予測の確率の平均を取る考え方のようです。3つのアルゴリズムが「Survived = 1」である確率を各々「90%」「30%」「75%」と予測した場合、総意としては「65%」と予測する形です。

そして、以降で、Freemanさんは、ensembleアルゴリズムそれぞれについて、ハイパーパラメータチューニングを実施しています。先ず、各アルゴリズムについて、ベストなセッティングを探して、そのベストセッティングから生成された予測機を、Votingに使おうという意図です。

先ず、各アルゴリズムのハイパーパラメータチューニングを行っているのが、以下コードです。

#WARNING: Running is very computational intensive and time expensive.（警告：計算には非常に時間がかかります。）
#Code is written for experimental/developmental purposes and not production ready!（計算は、実験的に書かれており、実運用への考慮は低いです。）


#Hyperparameter Tune with GridSearchCV: （ハイパーパラメータチューニングを、グリッドサーチ ✕ Cross Validationにて） http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
grid_n_estimator = [10, 50, 100, 300]
grid_ratio = [.1, .25, .5, .75, 1.0]
grid_learn = [.01, .03, .05, .1, .25]
grid_max_depth = [2, 4, 6, 8, 10, None]
grid_min_samples = [5, 10, .03, .05, .10]
grid_criterion = ['gini', 'entropy']
grid_bool = [True, False]
grid_seed = [0]


grid_param = [
            [{
            #AdaBoostClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
            'n_estimators': grid_n_estimator, #default=50
            'learning_rate': grid_learn, #default=1
            #'algorithm': ['SAMME', 'SAMME.R'], #default=’SAMME.R
            'random_state': grid_seed
            }],
       
    
            [{
            #BaggingClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
            'n_estimators': grid_n_estimator, #default=10
            'max_samples': grid_ratio, #default=1.0
            'random_state': grid_seed
             }],

    
            [{
            #ExtraTreesClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier
            'n_estimators': grid_n_estimator, #default=10
            'criterion': grid_criterion, #default=”gini”
            'max_depth': grid_max_depth, #default=None
            'random_state': grid_seed
             }],


            [{
            #GradientBoostingClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
            #'loss': ['deviance', 'exponential'], #default=’deviance’
            'learning_rate': [.05], #default=0.1 -- 12/31/17 set to reduce runtime -- The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 264.45 seconds.
            'n_estimators': [300], #default=100 -- 12/31/17 set to reduce runtime -- The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 264.45 seconds.
            #'criterion': ['friedman_mse', 'mse', 'mae'], #default=”friedman_mse”
            'max_depth': grid_max_depth, #default=3   
            'random_state': grid_seed
             }],

    
            [{
            #RandomForestClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
            'n_estimators': grid_n_estimator, #default=10
            'criterion': grid_criterion, #default=”gini”
            'max_depth': grid_max_depth, #default=None
            'oob_score': [True], #default=False -- 12/31/17 set to reduce runtime -- The best parameter for RandomForestClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'oob_score': True, 'random_state': 0} with a runtime of 146.35 seconds.
            'random_state': grid_seed
             }],
    
            [{    
            #GaussianProcessClassifier
            'max_iter_predict': grid_n_estimator, #default: 100
            'random_state': grid_seed
            }],
        
    
            [{
            #LogisticRegressionCV - http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV
            'fit_intercept': grid_bool, #default: True
            #'penalty': ['l1','l2'],
            'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], #default: lbfgs
            'random_state': grid_seed
             }],
            
    
            [{
            #BernoulliNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB
            'alpha': grid_ratio, #default: 1.0
             }],
    
    
            #GaussianNB - 
            [{}],
    
            [{
            #KNeighborsClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
            'n_neighbors': [1,2,3,4,5,6,7], #default: 5
            'weights': ['uniform', 'distance'], #default = ‘uniform’
            'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
            }],
            
    
            [{
            #SVC - http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
            #http://blog.hackerearth.com/simple-tutorial-svm-parameter-tuning-python-r
            #'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
            'C': [1,2,3,4,5], #default=1.0
            'gamma': grid_ratio, #edfault: auto
            'decision_function_shape': ['ovo', 'ovr'], #default:ovr
            'probability': [True],
            'random_state': grid_seed
             }],

    
            [{
            #XGBClassifier - http://xgboost.readthedocs.io/en/latest/parameter.html
            'learning_rate': grid_learn, #default: .3
            'max_depth': [1,2,4,6,8,10], #default 2
            'n_estimators': grid_n_estimator, 
            'seed': grid_seed  
             }]   
        ]



start_total = time.perf_counter() #https://docs.python.org/3/library/time.html#time.perf_counter
for clf, param in zip (vote_est, grid_param): #https://docs.python.org/3/library/functions.html#zip

    #print(clf[1]) #vote_est is a list of tuples, index 0 is the name and index 1 is the algorithm
    #print(param)
    
    
    start = time.perf_counter()        
    best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = 'roc_auc')
    best_search.fit(data1[data1_x_bin], data1[Target])
    run = time.perf_counter() - start

    best_param = best_search.best_params_
    print('The best parameter for {} is {} with a runtime of {:.2f} seconds.'.format(clf[1].__class__.__name__, best_param, run))
    clf[1].set_params(**best_param) 


run_total = time.perf_counter() - start_total
print('Total optimization time was {:.2f} minutes.'.format(run_total/60))

print('-'*10)

上記コードの結果が以下となります。

The best parameter for AdaBoostClassifier is {'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0} with a runtime of 37.28 seconds.
The best parameter for BaggingClassifier is {'max_samples': 0.25, 'n_estimators': 300, 'random_state': 0} with a runtime of 33.04 seconds.
The best parameter for ExtraTreesClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0} with a runtime of 68.93 seconds.
The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 38.77 seconds.
The best parameter for RandomForestClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'oob_score': True, 'random_state': 0} with a runtime of 84.14 seconds.
The best parameter for GaussianProcessClassifier is {'max_iter_predict': 10, 'random_state': 0} with a runtime of 6.19 seconds.
The best parameter for LogisticRegressionCV is {'fit_intercept': True, 'random_state': 0, 'solver': 'liblinear'} with a runtime of 9.40 seconds.
The best parameter for BernoulliNB is {'alpha': 0.1} with a runtime of 0.24 seconds.
The best parameter for GaussianNB is {} with a runtime of 0.05 seconds.
The best parameter for KNeighborsClassifier is {'algorithm': 'brute', 'n_neighbors': 7, 'weights': 'uniform'} with a runtime of 5.56 seconds.
The best parameter for SVC is {'C': 2, 'decision_function_shape': 'ovo', 'gamma': 0.1, 'probability': True, 'random_state': 0} with a runtime of 30.49 seconds.
The best parameter for XGBClassifier is {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'seed': 0} with a runtime of 43.57 seconds.
Total optimization time was 5.96 minutes.
----------

次に、それら予測機を使って、「hard voting」を実施します。

#Hard Vote or majority rules w/Tuned Hyperparameters（「hard voting」、または、「soft voting」を、ハイパーパラメータチューニングされた予測機にて）
grid_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
grid_hard_cv = model_selection.cross_validate(grid_hard, data1[data1_x_bin], data1[Target], cv  = cv_split)
grid_hard.fit(data1[data1_x_bin], data1[Target])

print("Hard Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_hard_cv['train_score'].mean()*100)) 
print("Hard Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_hard_cv['test_score'].mean()*100))
print("Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_hard_cv['test_score'].std()*100*3))
print('-'*10)

#Soft Vote or weighted probabilities w/Tuned Hyperparameters
grid_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
grid_soft_cv = model_selection.cross_validate(grid_soft, data1[data1_x_bin], data1[Target], cv  = cv_split)
grid_soft.fit(data1[data1_x_bin], data1[Target])

print("Soft Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_soft_cv['train_score'].mean()*100)) 
print("Soft Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_soft_cv['test_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_soft_cv['test_score'].std()*100*3))
print('-'*10)


#12/31/17 tuned with data1_x_bin
#The best parameter for AdaBoostClassifier is {'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0} with a runtime of 33.39 seconds.
#The best parameter for BaggingClassifier is {'max_samples': 0.25, 'n_estimators': 300, 'random_state': 0} with a runtime of 30.28 seconds.
#The best parameter for ExtraTreesClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0} with a runtime of 64.76 seconds.
#The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 34.35 seconds.
#The best parameter for RandomForestClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'oob_score': True, 'random_state': 0} with a runtime of 76.32 seconds.
#The best parameter for GaussianProcessClassifier is {'max_iter_predict': 10, 'random_state': 0} with a runtime of 6.01 seconds.
#The best parameter for LogisticRegressionCV is {'fit_intercept': True, 'random_state': 0, 'solver': 'liblinear'} with a runtime of 8.04 seconds.
#The best parameter for BernoulliNB is {'alpha': 0.1} with a runtime of 0.19 seconds.
#The best parameter for GaussianNB is {} with a runtime of 0.04 seconds.
#The best parameter for KNeighborsClassifier is {'algorithm': 'brute', 'n_neighbors': 7, 'weights': 'uniform'} with a runtime of 4.84 seconds.
#The best parameter for SVC is {'C': 2, 'decision_function_shape': 'ovo', 'gamma': 0.1, 'probability': True, 'random_state': 0} with a runtime of 29.39 seconds.
#The best parameter for XGBClassifier is {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'seed': 0} with a runtime of 46.23 seconds.
#Total optimization time was 5.56 minutes.

上記コードの結果が以下です。

Hard Voting w/Tuned Hyperparameters Training w/bin score mean: 85.22
Hard Voting w/Tuned Hyperparameters Test w/bin score mean: 82.31
Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 5.26
----------
Soft Voting w/Tuned Hyperparameters Training w/bin score mean: 84.76
Soft Voting w/Tuned Hyperparameters Test w/bin score mean: 82.28
Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 5.42
----------

ハイパーパラメータチューニングする前よりも精度が下がっていますね。この辺りが、ensembleの難しいところですね。各アルゴリズムが優秀になってしまうと、皆、同じ回答をし始めてしまい、結果、ensembleの効果が薄れるんですね。

そして、最後に、各種アルゴリズムの予測を提出用のテストデータにて実施しています。また、Freemanさんは、ソースコメント中に、その際のsubmit精度をコメントに記載してくれています。

#prepare data for modeling（モデリングのためにデータを準備する）
print(data_val.info())
print("-"*10)
#data_val.sample(10)



#handmade decision tree - submission score = 0.77990
data_val['Survived'] = mytree(data_val).astype(int)


#decision tree w/full dataset modeling submission score: defaults= 0.76555, tuned= 0.77990
#submit_dt = tree.DecisionTreeClassifier()
#submit_dt = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
#submit_dt.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_dt.best_params_) #Best Parameters:  {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
#data_val['Survived'] = submit_dt.predict(data_val[data1_x_bin])


#bagging w/full dataset modeling submission score: defaults= 0.75119, tuned= 0.77990
#submit_bc = ensemble.BaggingClassifier()
#submit_bc = model_selection.GridSearchCV(ensemble.BaggingClassifier(), param_grid= {'n_estimators':grid_n_estimator, 'max_samples': grid_ratio, 'oob_score': grid_bool, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_bc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_bc.best_params_) #Best Parameters:  {'max_samples': 0.25, 'n_estimators': 500, 'oob_score': True, 'random_state': 0}
#data_val['Survived'] = submit_bc.predict(data_val[data1_x_bin])


#extra tree w/full dataset modeling submission score: defaults= 0.76555, tuned= 0.77990
#submit_etc = ensemble.ExtraTreesClassifier()
#submit_etc = model_selection.GridSearchCV(ensemble.ExtraTreesClassifier(), param_grid={'n_estimators': grid_n_estimator, 'criterion': grid_criterion, 'max_depth': grid_max_depth, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_etc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_etc.best_params_) #Best Parameters:  {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0}
#data_val['Survived'] = submit_etc.predict(data_val[data1_x_bin])


#random foreset w/full dataset modeling submission score: defaults= 0.71291, tuned= 0.73205
#submit_rfc = ensemble.RandomForestClassifier()
#submit_rfc = model_selection.GridSearchCV(ensemble.RandomForestClassifier(), param_grid={'n_estimators': grid_n_estimator, 'criterion': grid_criterion, 'max_depth': grid_max_depth, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_rfc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_rfc.best_params_) #Best Parameters:  {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0}
#data_val['Survived'] = submit_rfc.predict(data_val[data1_x_bin])



#ada boosting w/full dataset modeling submission score: defaults= 0.74162, tuned= 0.75119
#submit_abc = ensemble.AdaBoostClassifier()
#submit_abc = model_selection.GridSearchCV(ensemble.AdaBoostClassifier(), param_grid={'n_estimators': grid_n_estimator, 'learning_rate': grid_ratio, 'algorithm': ['SAMME', 'SAMME.R'], 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_abc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_abc.best_params_) #Best Parameters:  {'algorithm': 'SAMME.R', 'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0}
#data_val['Survived'] = submit_abc.predict(data_val[data1_x_bin])


#gradient boosting w/full dataset modeling submission score: defaults= 0.75119, tuned= 0.77033
#submit_gbc = ensemble.GradientBoostingClassifier()
#submit_gbc = model_selection.GridSearchCV(ensemble.GradientBoostingClassifier(), param_grid={'learning_rate': grid_ratio, 'n_estimators': grid_n_estimator, 'max_depth': grid_max_depth, 'random_state':grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_gbc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_gbc.best_params_) #Best Parameters:  {'learning_rate': 0.25, 'max_depth': 2, 'n_estimators': 50, 'random_state': 0}
#data_val['Survived'] = submit_gbc.predict(data_val[data1_x_bin])

#extreme boosting w/full dataset modeling submission score: defaults= 0.73684, tuned= 0.77990
#submit_xgb = XGBClassifier()
#submit_xgb = model_selection.GridSearchCV(XGBClassifier(), param_grid= {'learning_rate': grid_learn, 'max_depth': [0,2,4,6,8,10], 'n_estimators': grid_n_estimator, 'seed': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_xgb.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_xgb.best_params_) #Best Parameters:  {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'seed': 0}
#data_val['Survived'] = submit_xgb.predict(data_val[data1_x_bin])


#hard voting classifier w/full dataset modeling submission score: defaults= 0.75598, tuned = 0.77990
#data_val['Survived'] = vote_hard.predict(data_val[data1_x_bin])
data_val['Survived'] = grid_hard.predict(data_val[data1_x_bin])


#soft voting classifier w/full dataset modeling submission score: defaults= 0.73684, tuned = 0.74162
#data_val['Survived'] = vote_soft.predict(data_val[data1_x_bin])
#data_val['Survived'] = grid_soft.predict(data_val[data1_x_bin])


#submit file
submit = data_val[['PassengerId','Survived']]
submit.to_csv("../working/submit.csv", index=False)

print('Validation Data Distribution: \n', data_val['Survived'].value_counts(normalize = True))
submit.sample(10)

なんと、面白いことに、割と単純なアルゴリズムの方が精度が出ている結果となったようです。この結果について、Freemanさんは次のChapterにてコメントしています。

尚、上記ソースの実行結果は、あまり意味がありません（ソース中のコメントが大事です）が、以下に記します。

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 21 columns):
PassengerId      418 non-null int64
Pclass           418 non-null int64
Name             418 non-null object
Sex              418 non-null object
Age              418 non-null float64
SibSp            418 non-null int64
Parch            418 non-null int64
Ticket           418 non-null object
Fare             418 non-null float64
Cabin            91 non-null object
Embarked         418 non-null object
FamilySize       418 non-null int64
IsAlone          418 non-null int64
Title            418 non-null object
FareBin          418 non-null category
AgeBin           418 non-null category
Sex_Code         418 non-null int64
Embarked_Code    418 non-null int64
Title_Code       418 non-null int64
AgeBin_Code      418 non-null int64
FareBin_Code     418 non-null int64
dtypes: category(2), float64(2), int64(11), object(6)
memory usage: 63.1+ KB
None
----------
Validation Data Distribution: 
 0    0.633971
1    0.366029
Name: Survived, dtype: float64

Chapter 12 — Conclusion and Step 7: Optimize and Strategize（最適化と戦略化）

当該最後のChapterにて、Freemanさんは以下のように締めくくっています。

・単純な決定木のアルゴリズムが結局、Accuracy：87%と、精度が最も高かった。
・それでも、提出用データセットでのsubmit精度は、78%程度がMAXであった。（しかも、御手製の決定木がその精度！）
・「Titanic」データセットは、学習用に渡されているデータと、提出用のテストデータが異なる分布をしているようで、Cross Validationをしたにも関わらず、手元で測定した精度と、Kaggle提出精度とには広いマージンが生まれてしまった。
・私としては、ここからの精度向上に向けては、前処理と特徴設計（Feature Engineering）に時間をかけるべきだと考えている。それが、手元のCross Validationスコアと、最終的なsubmitスコアとの差を縮める手当ではないか。

なるほど、Kaggle上での課題について書いていますね。そして、これは実運用上での課題でもありますね。手元の少ないデータで上手くいっても、実運用を進めてみると、思ったように上手くは機能しないケースが見つかることは多いです。そうした際には、新たに特徴生成を試みるなど、工夫を続ける必要がある訳ですね。

或いは、途中でFreemanさんが言っていたように、精度向上を進める際には、QCD、つまり、コストパフォーマンスも忘れてはなりません。その辺りに、データ分析の本質的な難しさが存在しますが、これから未来のエンジニアリングというものは、恐らくそういった営みが必須となってくることでしょう。

おわりに

データ分析の基本的な手順と、それらの意味、そして、必要な心意気について、FreemanさんのKaggle Kernelを通して学ぶことができました。今後、他のTop Kagglerの方達のKernelも、同様にまとめていきたいと思います。

また、以下のFrameworkが登場しました。データ分析を直接的に実施する人は勿論、データ分析を間接的に実施する人についても、以下は参考となるかと思います。

以下、Freemanさん曰くの「Data Science Framework」

1. Define the Problem
→ 課題設定をする。Technology Drivenでなく、Issue Drivenであれ。
2. Gather the Data
→ データを収集する。この時点で、Data Cleaningに気を配っておく。
3. Prepare Data for Consumption
→ 前処理を行う。Data CleaningやData Wranglingとも言うらしい。
4. Perform Exploratory Analysis
→ 探索的分析を行う。GIGOへの事前対策。相関分析。統計量の可視化。
5. Model Data
→ 機械学習を適用する。それは魔法の杖や銀の弾丸ではない。モデル選定の目利きたれ。
6. Validate and Implement Data Model
→ 学習モデルの精度検証をする。オーバーフィット、アンダーフィットに気を付ける。
7. Optimize and Strategize
→ 最適化と戦略化を実施する。実用化に向けた某を行う。

以上。

以下、Freemanさん曰くの「Data Cleaning’s 4C step」

1. Correcting：異常値と外れ値への対策
→ 中には、可視化を実施した後に除外するか判断するものも有り
→ 例えば、「Age=800」というデータがあったら、「Age=80」に訂正して問題ないだろう
2. Completing：欠損値の補完
→ 「Age」・「Cabin」・「Embarked」に欠損値が存在する。一部、欠損値のままで処理できるアルゴリズムもあるが、モデル同士の比較検討を考えると、補完しておいた方が都合が良い。
→ 一般的な対処方法は2つ、「レコード削除」か「妥当な値を充てがう」、妥当な値には、平均・中央値などの候補がある。
3. Creating：分析のための新特徴生成
→ 既存の特徴を用いて、新たな特徴を生成する。
4. Converting：機械学習の入力とするための数値化
→ カテゴリデータを機械学習の入力とするため、0 or 1のダミーデータに変換する。

以上。

「こういう手順があるんだ」と知るだけでも、データ分析との関わり易さが変わってくるかと思います。

最後まで読み進めていただいた読者も方にも、感謝いたします。読んで下さって、ありがとうございました。

P.S.

この記事を書くキッカケを与えてくれた、かつ、記事執筆をヘルプしてくれた、MICINインターンの尾原さん、濱野さん、吉川さんに感謝します。️️🙇️🙇‍♂️️️🙇‍

Kaggle Kernelsから読み解く！構造化データの解き方（LD Freeman’s kernel of Titanic Data）

はじめに

Kernels Documentation | Kaggle

Edit description

「Titanic」コンペに見るGlobal Standardな分析プロセス

Titanic: Machine Learning from Disaster

Start here! Predict survival on the Titanic and get familiar with ML basics

「LD Freeman」さんの解き方

A Data Science Framework: To Achieve 99% Accuracy

Using data from Titanic: Machine Learning from Disaster

Chapter 1 — How a Data Scientist Beat the Odds（Data Scientistは如何にして逆境に打ち勝つか）

Chapter 2 — A Data Science Framework（Data Scienceの手順紹介）

Chapter 3 — Step 1: Define the Problem and Step 2: Gather the Data（課題を決めて、データを集める）

Chapter 4 — Step 3: Prepare Data for Consumption（分析のためにデータを前処理する）

Chapter 5 — The 4 C’s of Data Cleaning: Correcting, Completing, Creating, and Converting（Data Cleaningのための4つの「C」）

Chapter 6 — Step 4: Perform Exploratory Analysis with Statistics（統計による探索的分析の実行）

Chapter 7 — Step 5: Model Data（データをモデリングする）

Chapter 8 — Evaluate Model Performance（モデル性能を評価する）

RMS Titanic - Wikipedia

RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in 1912 after the ship struck an…

Chapter 9 — Tune Model with Hyper-Parameters（ハイパーパラメータによってモデルをチューニングする）

GitHubコードから読み解く！DecisionTreeClassifier@scikit-learnの各種オプション解説

はじめに

Chapter 10 — Tune Model with Feature Selection（特徴選択によるモデルのチューニング）

Chapter 11 — Step 6: Validate and Implement（評価と改善）

sklearn.ensemble.VotingClassifier - scikit-learn 0.21.2 documentation

If 'hard', uses predicted class labels for majority rule voting. Else if 'soft', predicts the class label based on the…

1.11. Ensemble methods - scikit-learn 0.21.2 documentation

In ensemble algorithms, bagging methods form a class of algorithms which build several instances of a black-box…

Chapter 12 — Conclusion and Step 7: Optimize and Strategize（最適化と戦略化）

おわりに

P.S.

Written by Taketo Kimura